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Stop-Consonant Recognition: Release Bursts and Formant Transitions . as 
Functionally Equivalent, Context-Dependent Cues 

M, F, Dorman,* M. Studdert-Kennedy , and L, J, Raphael 



ABSTRACT 

Three experiments studied the roles of releas.e bursts and for- 
mant transitions as acoustic cues to place of articulation in sylla- 
ble-initial voiced stop consonants. Experiments I and II assessed 
the weight of these cues by systematically removing them from American 
English ./b,d,g/, spoken before nine different vowels by two speakers. 
Experiment III assessed the functional invariance of the release burst 
by transposing it from the nine syllables of speaker 2 across all 
eight vowels for each class of stop consonant. The results showed 
that labial and apical bursts were largely invariant in their effect 
before all vowels; velar bursts before front vowels and velar bursts 
before central-back vowels were also invariant within their set. 
However, release bursts carried significant perceptual weight in only 
one syllable out: of 27 for speaker 1, in only 13 syllables out of 
27 for speaker 2. For speaker 2 labial and velar bursts carried sig- 
nificant weight primarily before central-back, rounded vowels, apical 
bursts primarily before high, front, unrounded vowels. Furthermore, 
burst and transition tended to be reciprocally related: where the 
perceptual weight of one increased, the weight of the other declined. 
They were thus shown to be functionally equivalent, context-dependent 
cues, each contributing to the rapid spectral changes that follow con- 
sonantal release. The results were interpreted as pointing to the 
important role played by the front-cavity resonance in signaling 
place of articulation. 
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INTRODUCTION 

The present paper deals with an aspect of the problem of perceptual con- 
stancy — the invariance problem — in speech recognition. At the level of pho- 
neme recognition the problem is manifest in the variety of acoustic signals 
that may be categorized as the same phoneme. This variability arises from 
several sources. For a fuller discussion than can be given here, see Studdert- 
Kennedy (1974). 

One source is differences among speakers' vocal-tract dimensions. Since 
the area function of the vocal tract determines the resonant (formant) fre- 
quencies by which a particular phoneme is cued, the formant patterns of sig- 
nals produced by a child may be quite different from those produced by an 
adult: in fact, formant frequencies for a given vowel may differ by as much 
as 30 percent. Moreover, the formant frequencies for a child often approxi- 
mate those of a different vowel spoken by an adult. Even within a single 
speaker; several sources contribute to vowel variability. Lindblom (1963), 
for example, found that formant frequencies may vary by a factor of 2.3:1, 
depending on whether the vowel is spoken in isolation or in consonantal con- 
text. Moreover, in rapid speech the tongue often does not reach the articu- 
latory "targets" achieved in deliberate speech, so that vowel formant fre- 
quencies tend to be "reduced." 

Phonetic context and rate also alter the acoustic cues for consonants. 
As an example of the effects of context, syllable- initial /b/ before the vowel 
/a/ is characterized by an upward spectral change, syllable-final /b/ follow- 
ing /a/, by a. downward spectral change. For a second example, the voiced- 
voiceless distinction in stop consonants (/b/ vs. /p/, /d/ vs. /t/, /g/ vs. 
/k/) is cued primarily by voice onset time (VOT) (the interval between con- 
sonantal release and the onset of phonation) in dissyllables with stress on 
the second syllable; by the intersyllable interval in dissyllables with an 
unstressed second syllable; by vowel duration in syllable-final stops when 
unreleased, and by the spectrum of the release burst in syllable-final stops 
when released. As an example of the effects of rate,- the VOT distributions 
of voiced and voiceless English stops do aot overlap if the stops are spoken 
in citation form, but may overlap considerably if the stops are spoken in 
sentence context (Lisker and Abramson, 19^7). 

in the present paper we are concerned with yet another aspect of the 
invariance problem—the variation in acoustic cues for a given stop consonant 
as a function of the following vowel. Many studies have demonstrated that 
formant transitions are generally sufficient cues for stop-consonant recogni- 
tion. Since the shape of these transitions varies with the following vowel, 
accounts of stop-consonant recognition have generally emphasized the role of 
context-conditioned cues (perhaps relational invariants) within the consonant- 
vowel syllable. Recently, however. Cole and Scott (1974a, 1974b) have sug- 
gested that stop consonants before different vowels may be recognized in 
terms of a context-independent acoustic cue (or simple invariant), namely, 
the burst produced at the release of stop-consonant occlusion. 

In the following experiments we explore these cues in some detail with 
natural speech. We assess, first, the extent to which separable components 
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of the complex of acoustic cues for initial, voiced stop consonants — the re- 
lease burst, the devoiced, and the voiced formant transitions — are sufficient 
cues for the perception of place of articulation. We ask, second, whether one 
of the components — the burst — is an invariant cue for stop-consonant recogni- 
tion. Finally, we discuss the implications of our results for an account of 
stop-consonant recognition. 

Acoustic Segmentation of Stop-Consonant-Vowel Syllables 

Acoustic analysis of /bV,. dV, gV/ syllables, reveals five qualitatively 
distinct segments before a stable vowel formant pattern is reached (cf. Fischer- 
J^rgensen, 1954, 1972; Halle, Hughes, and Radley, 1957; Fant, 1969): (1) a 
period of occlusion (usually silent, though occasionally voiced); (2) a transient 
explosion (usually less than 20 msec) produced by shock excitation of the vocal 
tract upon release of occlusion; (3) a very brief (0-10 msec) period of frica- 
tion, as articulators separate and air is blown through a narrow (though widen- 
ing) constriction, as in the homorganic fricative; (4) a brief period (2-20 
msec) of aspiration, within which jnay be detected noise-excited formant transi- 
tions, reflecting shifts in vocal-tract resonances as the main body of the 
tongue moves toward a position appropriate for the following vowel; (5) voiced 
formant transitions, reflecting the final stages of tongue movement into the 
vowel during the first few cycles of laryngeal vibration. Since we are only 
concerned with stop consonants in the present study, we shall not consider the 
role of the first segment (occlusion) which serves to distinguish stops from 
vowels and other consonants. Furthexmore, since the explosion and frication, 
even if separable on an oscillogram or spectrogram, are probably not discrimin- 
able by ear, we shall treat them in what follows as a single burst of energy, 
lasting some 2-30 msec. 

The fourth segment (aspirated or devoiced formant transition) , although 
usually distinguishable on an oscillogram with a high resolution time scale, is 
not always readily apparent on a spectrogram (see Figure 1). Investigators have 
therefore tended to discbunt it as an acoustic cue-^Tand to concentrate. attention 
on the burst and on the voiced formant transition. " The present paper attempts 
to redress the balance by treating this segment as a separable component of the 
cue complex. 

Bursts and Transitions as Cues for Stop Consonants 

Research with synthetic speech has revealed that both bursts and voiced 
formant transitions may serve as separate cues to place of articulation of ini- 
tial /b,d,g/. Many studies have shown that transitions of the second and third 
formants are sufficient cues for the place distinction (for example, Liberman, 
Cooper, Delattre, and Gerstman, 1954; Delattre, Liberman, and Cooper, 1955), and 
these are, in fact, the standard cues used in speech synthesis. It is important 
to note that — since the acoustic shape of formant transitions varies as a func- 
tion of the following vowel — formant transitions are necessarily context- 



Viceless transitions have been given due weight , in studies of voiceless stops 
(Liberman, Delattre, and Cooper, 1958) and fricatives (Harris, 1958). 
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100msec 

Time 

Figure 1: A spectrogram of the syllable /gAd/, spoken by speaker 2 (top). 

An oscillogram of the same utterance is shown at the bottom: 
burst (a-b) and aspiration (b-c) duration and the onset of voicing 
(c) are indicated by vertical lines. 



dependent cues for stop consonants. The same is true of velar bursts. Hoffman 
(1958) found that while bursts centered at frequencies above 3000 Hz acted as 
cues for /d/, burst cues for /g/ lay near the second formant of the vowel and 
were therefore context-dependent (cf .. Liberman, Delattre, and Cooper, 1952). 
Hoffman could find no burst that would serve as a powerful cue for /b/, but this 
may have reflected, in part, the deficiencies of his synthesizer, rather than of 
natural speech. 

In fact, attention has recently turned to the question of how cues isolated 
in synthetic speech experiments act and interact in naturally produced speech. 
With respect to the voiced stop consonants. Cole and Scott (1974b) have 
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questioned the role of the Tonnant transitions in carrying phonetic information. 
These authors, following Day (1970) and Liberman, Mattingly, and Turvey (1972), 
have suggested that a major role of context-dependent formant transitions is to 
providi* information about the temporal order of the segments in the speech sig- 
nal. Cole and Scott (1974b) go further to suggest that phonetic information is 
carried primarily by a simple invariant cue, and that for /b,d,g/ the invariant 
place due lies in the initial noise energy (burst and aspiration) before the 
onset of laryngeal vibration. 

The latter claim drew apparent support from a recent experiment by Cole and 
Scott (1974a) . Using a tape-splicing procedure to remove formant transitions 
from /bl,bu,di,du,gi,gu/, thus leaving burst and aspiration followed by steady- 
state vowel, Cole and Scott found that recognition of the syllables remained 
essentially unimpaired. Moreover, when the initial energy from /bi/ was trans- 
posed to /u/, or the initial energy from /bu/ was transposed to /i/, recognition 
was again unimpaired. This relation was also reported for /di/ and /du/. How- 
ever, for the /gi/ to /u/ transposition, 90 percent /b/ responses were reported. 
The /gu/ to /i/ transposition fared better with 82 percent correct responses. 
Cole and Scott (1974a:101) concluded that "stop consonants may be recognized be- 
fore different vowels... in tenns of invariant acoustic features." 

Implicit in this conclusion is the assumption that bursts are not only in- 
variant, but sufficient cues to place of articulation. For if they are not suf- 
ficient, it matters little whether or not they are invariant. However, it has 
been known for a number of years both from synthesis experiments (Libennan, 
Delattre, and Cooper, 1952; Hoffman, 1958) and from the acoustic analysis of 
natural speech (Fischer-J^rgensen, 1954, 1972; Halle, Hughes, and Radley, 1957; 
Fant, 1969) that, while release burst spectra vary systematically with the fol- 
lowing vowel for initial velar stops, they are largely invariant for initial 
labial and apical stops. The most novel aspect of Cole and Scott's (1974a) 
conclusion is therefore, that burst cues are sufficient for recognition of stop- 
consonant place of articulation. Several considerations suggest that this 
claim may merit more careful consideration. 

First, Cole and Scott (1974a) made no attempt to separate the release burst 
from the context-conditioned voiceless aspiration. If we examine the spectro- 
grams of Figure 2 in Cole and Scott (1974a: 104), we see obvious acoustic differ- 
ences between the transposed portions of syllable pairs. Had listeners been 
asked to identify the vowels of these transposed portions, they might well have 
been able to do So, thus demonstrating that the experimenters had transposed 
not consonants, but whispered consonant-vowel (CV) syllables. In fact, Winitz, 
Scheib, and Reeds (1972), in an experiment closely related to that of Cole and 
Scott (1974a), have reported precisely this result for the (admittedly longer) 
burst and aspiration portions of initial /p,t,k/. 

A second reason to question Cole and Scott's conclusion is that they trans- 
posed energy for the voiced stops between only two vowels. Since most dialects 
of English contain approximately 16 distinctive vowel nuclei, transpositions 
over two vowels represent a rather meager test of their hypothesis. Indeed, 
Fischer- J^5rgensen (1972) has shown for Danish /b,d,g/ that bursts are effective 
cues for /b/ and /g/ before /i/ and /u/, but not before /a/, while for /d/ a 
burst is an effective cue before /i/, but not before /a/ or /u/. Thus, there 
is already evidence from a language other than English that bursts are not ade- 
quate cues for the distinction among these stop consonants in certain vowel en- 
vironments, c 
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A third conaideration is that the release bursts, claimed by Cole and 
Scott (1974fii) as sufficient and invariant cues, have not proved to be sufficient 
for automatic speech recognition. If these cues were indeed sufficient and in- 
variant, it would be a simple enough matter to specify their acoustic values and 
build the appropriate filters into a speech recognition device. In practice, 
this has not been done, partly because in natural speech, release bursts are 
absent from stops in unstressed syllables and from syllable-f inal stops in all 
syllables at least as frequently as they are present in the stressed syllables 
to which Cole and Scott gave their attention, 

A final, and perhaps the most important consideration, is that articulatory 
gestures associated with initial labial, apical, and palatovelar stop consonants 
before a variety of different vowels give rise to systematic variations in syl- - 
labic acoustic structure that make the hypothesis of any single sufficient cue 
(whether burst or transition) across all environments extremely unlikely. 
Every researcher who has worked on speech synthesis is familiar with the fact 
that a "good" rendering of a particular phonetic segment may require different 
acoustic patterns in different phonetic environments. For example, good initial, 
voiced apical stops are more readily synthesized with a burst before high, front 
vowels, but with extensive voiced formant transitions before back vowels. Fur- 
thermore, even though isolated cues may serve a valid experimental function, 
natural speech typically displays a complex of cues with varying acoustic 
salience and therefore, we may suspect, varying perceptual weight in different 
environments. It will simplify the description and interpretation of oui: exper- 
imental results, if we here spell out the most important acoustic variations and 
some possible perceptual consequences. For more detail than we can give here, 
the reader is referred to Fischer- J«4rgensen (1954, 1972), Halle, Hughes, and 
Radley (1957), Fant (1959, 1960, 1969), Flanagan (1972), Heinz (1974) and Klatt 
(1975). 

Release burst energy . The energy (duration x intensity) in the transient 
release and its following fricatlon varies as a function of several factors i in- 
cluding the cross-sectional area of the constriction just after release, the 
resonant cavity in front of the point of release and perhaps, the release ges- 
ture itself. Thus, /b/ for which there is essentially no front cavity and for 
which the release gesture is rapid (Fujimura, 1961; Kuehn, 1973) usually dis- 
plays a weak transient and virtually no frication, while /g/ for which the cross- 
sectional area between tongue and palate is relatively large, for which the front 
cavity is narrowly tuned and for which tongue release is relatively slow, dis- 
plays the longest burst of the three stops, including, on occasion as Fischer- 
Jrfrgensen (1954) noted, a "double" release transient (see Figure 1) [perhaps due 
to a suction effect (Fant, 1969)]. Burst energy for /d/, with a smaller cross- 
sectional area between tongue and alveolar ridge and a tnore broadly tuned front 
cavity than for /g/, but with a release velocity roughly the same as for /b/, 
falls midway. We might then predict increasing energy in — and therefore per- 
cep::ual importance of — the burst as the point of occlusion moves back in the 
ipouth. 

Cutting across all three places of articulation however, are possible vari- 
ations in burst energy due to coarticulation with the following vowel. A major 
contrast is between front unrounded vowels, such as /i,i,e/ and center-to-back 
rounded vowels, such as /3^,o,u/. For /b/, increased cross-sectional area of the 
constriction just after release may give rise to a longer and so more effective, 
release burst before rounded, than before unrounded vowels. For /d/, elongation 
of the front cavity before rounded vowels is likely to yield lower burst 
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intensity than before unrounded vowels. For /g/ the effect of front cavity 
elongation before rounded vowels may be counteracted by increased cross-sectional 
area of the palato-lingual constriction and narrower front-cavity tuning, than 
before unrounded vowels. Thus, if we assume that acoustic energy at least par- 
tially determines auditory salience and perceptual weight, we might expect the 
release burst to play a more important role before rounded than before unrounded 
vowels for /b/ and /g/, but exactly the reverse for /d/. 

Release burst spectrum . Spectral sections taken through the release burst 
of /b/ in nine vocalic environments show a broad curve with peaks over low fre- 
quencies, below approximately 2000 Hz (see Figure 2); the low frequency peaks 
tend to be stronger before rounded than before unrounded vowels. For /d/ the 
spectral curve is broad and of a relatively high intensity with peaks over 
higher frequencies, above approximately 2000 Hz (see Figure 2); the peaks tend 
to shift upward before unrounded vowels and to be somewhat stronger than before 
rounded vowels. Apart from these minor rounding dependencies, /b/ and /d/ 
bursts are relatively unaffected by the following vowel. We may note, however, 
that these bursts do not occupy invariant positions on the frequency scale in 
relation to their following vowels; the apical burst is spectrally continuous 
with F2/F2 of the high front vowels, but spectrally distinct from F2 of the back 
rounded vowels; for the labial burst these relations tend to be reversed. The 
spectrum of the velar burst, on the other hand, is narrow and of a relatively 
high intensity with its main peak close to F3 of a following front vowel, and 
close to F2 of the following back vowel, reflecting the changes from the front 
articulation of /gi/ to the back articulation of /gu/. Thus, while labial and 
apical bursts are largely invariant on the frequency scale, but variable in re- 
lation to following vowel, velar bursts are more or less invariant in relation 
to the following vowel, but variable on the frequency scale. [For a more com- 
prehensive description of burst spectra in different vocalic environments, see 
Zue^ (1976).] The possible perceptual implications of these facts will become 
clear when we report our results. 

Formant- transition range and energy . At lea^'t three articulatory factors 
underlie variations in formant- transition structure. First, are variations in 
the extent of transitions as a function of place of articulation and following 
vowel. For bilabials, transitions are longer (and so, presumably more effective 
cues) before unrounded than before rounded vowels. For apical stops the distance 
between point of occlusion and vowel-target configuration varies, so that we 
might expect both devoiced and voiced transitions to be more effective cues to 
/d/ before back vowels, where transitions are relatively long, than before front 
vowels, where they are relatively short. Finally, for velars the determining 
factor is degree of similarity between the velar tongue constriction and that 
of the following vowel; in general, close vowels (such as /i/) will have rela- 
tively little transition, and open vowels (such as /a/), a more marked transition* 

A second factor affecting f ormant-transition structure is the onsets of 
voicing relative to onset of the release burst [i.e., VOT (Lisker and Abramson, 
1964)]. An increase in the time taken for consonantal relec. e (i.e., in release 
burst duration) leads to an increase in the time taken for development of a 
transglottal pressure drop sufficient to initiate voicing, and so to an in- 
crease in VOT. If VOT is increased, transitions into the following vowel may 
be largely complete at voicing onset, so that the duration of devoiced 
transitions relative to voiced transitions is increased. Since release burst 
duration (and so VOT) typically increases from labial to apical to velar points 

■'"Acoustic Characteristics of Stop Consonants: A Controlled Study, by Victor 
Waite Zue, Ph.D. thesis, M.I.T., May, 1976. 
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Figure -2: Spectra of the bursts from syllable-initial /b,d,g/ spoken before 

nine vowels by speaker 2. The velar spectra have been divided into 
front and center-back vowel series. 
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of articulation (Lisker and Abramson, 1964 liable 1), we may reasonably predict 
corresponding increases in the perceptual weight attached to de voiced transi- 
tions. 

Finally, speakers differ in vocal-tract shape and dimensions, as well as in 
articulatory habits (Bell-Berti, 1975), and even two phonetically identical ut- 
terances of the same speaker are probably never identical acoustically. If we 
add chance variations in relative effectiveness of bursts and transitions, due 
to such factors as distance between speaker and listener (or between speaker and 
microphone), we must conclude that predictions of the perceptual weight attached 
to the several acoustic cues to place of articulation can be, at best, statisti- 
cal, and that the likelihood of any single cue being the sole determinant of 
the percept in all contexts is extremely low. 

As will be seen, the results of the following three experiraonts support 
this conclusion. Experiment I assesses the role of bursts and formant transi-. 
tions in the recognition of natural speech by systematically removing them from 
American English /b,d,g/ spoken before nine different vowels by a single speaker. 
Experiment II replicated Experiment I with a different speaker. These two ex- 
periments are thus concerned with whether the manipulated cues are sufficient 
for recognition. Experiment III, on the other hand, is concerned with whether 
the release burst is functionally invariant ; it assesses the invariant cue value 
of the release burst for the second speaker by transposing it from each conso- 
nant-vowel-consonant (CVC) syllable across all vowels for each class of stop 
consonant. 

EXPERIMENTS I AND II 

Experiment I 

Nine CVC syllables were recorded by a male speaker in a carrier phrase, 
"The little CVC dog," with stress on the CVC. Two tokens of all combinations of 
initial /b,d,g/, followed by /i, i , e,3e, A,a, o,u, 3y , with a constant syllable-final 
/d/ were recorded. In addition, phrases of the type, "The little VC dog" ("The 
little vowel- consonant dog") were recorded, where V was again one of the nine 
vowels above and C was again /d/. The phrases were digitized with an effective 
frequency response of 160-7000 Hz, by means of the Haskins Laboratories pulse 
code modulation system (Cooper and Mattingly, 1969), and the test syllables were 
excised and edited. Two parallel sets of 45 experimental signals were then con- 
structed from the oscillograms by r.h^^ following steps: 

1. Each syllable was left in original form. 

2. From each CVC the burst was removed. A burst was defined as an utter- 
ance initial, high amplitude (relative to the surrounding signal) component of 
the signal (see Figure 1). The duration of the burst was determined for each 
syllable on a high resolution oscillogram; the values were quite consistent 
across the two tokens of each syllable. Table 1 lists these durations averaged 
across tokens. The mean burst duration for /b/ was 4.3 msec, for /d/ 6.3 msec, 
for /g/ 11.7 msec. 

3. Each burst was attached to its corresponding VC syllable (for example, 
the /bid/ burst was attached to /id/), leaving a silent interval between the 
end of the burst and the first voiced pulse of the vowel, equal in duration to 
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TABLE 1: Release burst and "aspiration" durations, and voice onset times in^ 
milliseconds, for /b,d,g/ followed by nine vowels for two speakers- 



Speaker 1 
Release burst Aspiration 



Syllable 

/bid/ 
/bid/ 
/bed/ 
/baed/ 
/bAd/ 
/bad/ 
/bod/ 
/bud/ 
/b^d/ 

Mean 

/did/ 
/did/ 
/ded/ 
/dffid/ 
Id Ad I 
I dad I 
/dod/ 
/dud/ 
/d3ti/ 

Mean 

/gid/ 
/gid/ 
/gEd/ 
/gaed/ 

/gAd/ 

I gad I 
I god I 
/gud/ 

igydi 

" Mean 



in msec 



4 
5 
5 
3 
5 
3 
3 
7 
4 



4.3 



7 
7 
6 
8 
6 
5 
5 
6 
7 



6.3 

7 

21 
7 
12 
14 
11 
13 
8 
12 

11.7 



in msec 

5 
1 
5 
2 
2 
1 
6 
4 
5 



3.4 

1 
5 
7 
6 
6 
6 
7 
6 

7 

5.7 

12 
3 

11 
6 
9 
6 
7 

11 

11 

8.4 







Speaker 2 




VOT 


Release burst Aspiration 


VOX 


in msec 


in msec 


in msec 


in. tnsec 


9 


6 


4 


10 


c 
0 


9 


9 


15 


10 


9 


4 


13 


5 


7 


8 


15 


7 


9 


. 4 


13 


4 


11 


4 


15 


9 


6 


6 


12 


11 


10 


14 


24 


9 


10 


13 


23 


7 7 


8.6 


7.3 


16.0 


8 


25 


12 


37 


XZ 


15 


6 


21 


13 


12 


13 


25 


14 


13 


8 


21 


12 


10 


5 


IS 


11 


7 


8 


IS 


12 


8 


7 


IS 


12 


5 


10 


IS 


14 


10 


15 


25 


12. 0 


11.7 


9.3 


?1 . 0 


19 


25 


10 


35 


24 


17 


18 


35 


18 


22 


14 


36 


18 


29 


7 


36 


23 


18 


13 


31 


17 


21 


25 


46 


20 


20 


15 


35 


19 


20 


15 


35 


23 


20 


21 


_il 


20.1 


21.3 


15.3 


36.7 



Voice onset time (VOT) is the sum of release burst, affrication, and aspiration 
dur at ions. 

^The values for speaker 1 are the averages of two tokens of each syllable; for 
speaker 2, of one token of each syllable. 
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the intel^val between burst offset and voicing onset in the CVC from which the 
burst hM been removed. 

4. For each CVC the entire signal up to the first well--defined voicing 
pulse w^^ removed. Thus, the burst and devoiced formants (i.e., noise excited 
resonances) were removed (see Figure 1) , and the duration of this segment was 
measured on an oscillogram of each utterance. Table 1 lists the two-token aver- 
ages of the devoiced formants ("aspiration"), as well as of the entire segment 
from buiT^t oUset to voice onset (VOT) for each syllable. Mean VOT for /b/ was 
7.7 msec» for /d/ 12.0 msec, for /g/ 20.1 msec. 

5. Each burst-plus-devoiced formants was then attached to its correspond-- 
ing VC syllable. 

This procedure permitted us to present five different combinations of the 
cues to Pla.'ze of articulation (burst, devoiced transition, voiced transition) 
for eacli syllable: (a) all three together in the original syllable; (b) burst 
plus vowel; (c) burst and devoiced transitions plus vowel; (d) voiced transitions 
plus voV^l; (e) devoiced and voiced transitions plus vowel. 

ThiT^e recordings of each of the 45 signals in each set were generated and 
r^^ndoml^^d into two. parallel test sequences of 135 items each. On^ test was 
administered to 14 Lehman College undergraduates. The stimuli were played at 
a comfortable level in a sound attenuated room, on a Revox 1122 tape recorder, 
over an ^udiometric loudspeaker. The other test was administered to nine stu- 
dents an4 faculty volunteers from Yale University: the stimuli were played at 
a comfortable level in a sound attenuated room on an Ampex AG 400 tape recorder 
over an Ar4x loud speaker at Haskins Laboratories. 

The listeners were instructed to write the identity of the initial sound 
of each Syllable. The response categories listed on the answer sheets were 
/b,d,g,p, t,k, ?,0/.2 The ? response was for use when the listener thought that 
the syllable began with a consonant, but could not decide which one. The 0 re- 
sponse V^s for use when the listener thought that the syllable began with a 
vowel. Twenty tokens of the stimuli were played to familiarize the listeners 
with the task. The listeners were then presented with one of the 135-item test 
sequences . 

Experime u t 11 

Ex^^tly the same procedures of stimulus and test construction as those de- 
scribed ^bove were followed for a second speaker, except that he provided only 
one token, of each syllable and therefore only one test. The sentences were read 



'A relatively open response set provides a sensitive measure of how "stoplike" a 
signal Sounds. In a situation where only /b,d,g/ are permitted as responses, 
the identifiability of the signals may be overestimated. For example, a signal 
composed of labial burst and a steady-state vowel such as /i/, sounds like a 
click followed by /i/. However, if only /b,d,g/ are permitted as responses, 
then a subject may well feel that, since the click does not sound like a high- 
frequen^iy alveolar burst, and is not affricated like a velar burst, (s)he should 
respond /b/. A correct /b/ response would then be made to a signal that does 
not soui^d like /b/. 
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at a very deliberate rate with stress on the initial consonant of the CVC. 
Table 1 lists the durations in msec of burst, "aspiration" and VOX for each syl- 
lable. The durations are very much longer than (almost double) those of 
speaker 1. However, the pattern of increase in burst and VOT durations, from 
labial to apical to velar stops, is similar to that of speaker 1. 

Eleven Lehman College undergraduates took the test under conditions identi- 
cal to those of the Lehman College students in Experiment I. 



RESULTS 



Experiment I (Speaker 1) 

The two groups of subjects gave very similar results on the two parallel 
tests. We have therefore combined their data. Figure 3 displays percentage 
correct identification of initial consonantal place of articulation as a func- 
tion of vowel nucleus for the five sets of cue combinations (all cues, burst 
plus vowel, burst and voiceless transition plus vowel, voiced transition plus 
vowel, voiced and voiceless transition plus vowel) and the three classes of con- 
sonant (labial, . apical, vel^r) . Responses were scored for place of articulation 
only, and voicing errors were disregarded. Each data point is based on 69 re- 
sponses (23 subjects x 3 repetitions). The vowels have been ordered along the 
horizontal axis to trace a rough path around the rim of the English vowel loop 
from /i/ through /a/ to /u/, with l^i appended. The points have been connected 
by straight lines to facilitate reading of the graphs. 

Labial. All the original syllables, except /bud/ (85 percent), were cor- 
rectly-identified more than 90 percent of the time. The burst was relatively 
ineffective as a cue and performance hovered around chance (20 percent) before 
all vowels, except /u/ (81 percent) and llH (51 percent). The voiced transi- 
tion, on the other hand, served almost as well as the full syllable and perfor- 
mance hovered ardund 90 percent before all vowels, except /o/ (84 percent), /u/ 
(63 percent),, and IJI (61 percent), the last two vowels being precisely those 
for which burst performance was at its best* The addition of the devoiced 
transition, whether to burst or voiced transition, tended to increase perfor- 
mance by a few percentage points, but this cue clearly carried little perceptual 
weight. 

Apical . All the original syllables, except /did/ (87 percent), /dgd/ (74 
percent), and /dad/ (81 percent) were correctly identified more than 90 percent 
of the time. The burst was a moderately effective cue before the front vowels, 
/i/ (57 percent) and /i/ (65 percent), but otherwise carried little weight, and 
was only marginally aided by addition of the devoiced transition. The full 
transition (devoiced and voiced portions), on the other hand, was a moderately 
effective cue (60 percent or higher) before the back and central vowels, but a 
weak cue before the front vowels. There seems to be a reciprocal' relation be- 
tween burst and transition; if the weight of one is high, the weight of the 
other is low. Wherever the full transition carried any marked weight, removal 
of its devoiced portion led to an appreciable drop in performance, particularly 
before /u/ and llfl . In general, neither burst nor transition alone maintained 
performance at the level of the original syllables. 

Velar . All the original syllables except /gid/ (56 percent), /gid/ (70 
perce^Tt^TT/gEd/ (61 percent), and /gad/ (88 percent) were correctly identified 



12 



19 



EXPERIMENT I 



lOOr 



75- 



50 



25- 




X Original Syllable 
O-O Bur$t •►Vowel 
- j]Bur»t-»Devoiced 

/Transition ♦Vowel 
/y^jVoiced 

/Transition* Vowel 
j Vo i ced ♦ Devoi ced 
/Tronsitton* Vowel 



tOOr 



/dVd/ 




/gVd/ 




1 I e ae A a 0 u 3r 

FOLLOWING VOWEL 



Figure 3: Percent correct recognition of place of articulation for speaker 1 
as a function following vcwel. The five different combinations of 
cues to place of articulation are parameters of the curves. 
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more than 90 percent of the time. The burst elicited moderate performances only 
before hi (41 percent) and I7h (69 percent), and was appreciably aided by addi- 
tion of the devoiced transition only before /u/. For the full transitions, per- 
formance was moderate before /e/ (48 percent), /a/ (41 percent), /u/ (75 per- 
cent), and /?/ (73 percent) , but weak elsewhere. Just as for /d/, removal of 
the devoiced portion of the transition had a marked effect before /u/ and 737. 
There is again some evidence of a reciprocal relation between burst and transi- 
tion. Even more obviously than for /d/, no subset of the cues held performance 
at the level of the original syllables. 

Experiment II (Speaker 2) 

Figure 4 displays the results for tokens from the second speaker in the 
same format as Figure 3. For /b/ and /d/, the pattern of results. is similar to 
that of Experiment I, apart from a general increase in level of performance; 
for /g/ the perceptual weight of the burst is clearly greater than it was for 
speaker 1. It will be recalled that the duration of the bursts and aspiration 
segments of speaker 2's utterances was very much greater than (nearly double) 
that of the corresponding segments of speaker 1 (see Table 1) . 

Labial. All the original syllables, except /bud/ (85 percent), were cor- 
rectly identified more than 90 percent of the time. The burst was moderately 
effective as a cue before all vowels, especially the central to back vowels, 
lol (79 percent), /u/ (79 percent), and lyj (85 percent), and was as effective 
as the full syllable for Izl (97 percent). The full transition served almost as 
well as the full syllable for all vowels except /a/ (75 percent), /u/ (36 per- 
cent) , and /3y (42 percent) , the last two again being the vowels for which 
burst performance was at its best. The perceptual effect of adding the devoiced 
transition, whether to burst or voiced transition, was generally small, and not 
reliable. 

Apical . All the original syllables were correctly identified more than 90 
percent of the time. The burst was a strong cue before /i/ (100 percent) and 
lyf (91 percent), moderate before /i/ (72 percent) and Izl (79 percent), but 
otherwise carried little weight. Addition of the devoiced transition to the 
burst had no systematic effect. The full transition was almost as effective as 
the full syllable for central and back vowels, but was a weak cue before the 
front vowels. Removal of the devoiced portion of the transition tended to lower 
performance, especially before /u/. Performances on bursts and transitions 
were reciprocally related before all vowels except /ae/ and /SV. 

Velar . All the original syllables, except /gid/ (64 percent), were cor- 
rectly identified more than 90 percent of the time. The burst was a moderately 
effective cue before /i/ (73 percent), Izl (52 percent), and /a/ (55 percent), 
almost as effective as the full syllable before l±U hU I ol y 1^1 y and IJI . 
Addition of the devoiced transition had no systematic effect. The full transi- 
tion was a moderately effective cue before /i/ (64 percent) and Izl (46 percent), 
a strong cue before /a/ (91 percent), but otherwise carried little or no percep- 
tual weight. Removal of its devoiced portion tended to reduce performance, par- 
ticularly before /i/ and /a/. Burst and transition again tend to be reciprocal- 
ly related, particularly before central and back vowels, except /a/. 
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EXPERIMENT E 
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Figure 4- Percent correct recognition of place of articulation for speaker 2 
Figure ^e ^ ^^^^^^^^ following vowel. The five different combxnations 
of cues to place of articulation are parameters of the curves. 
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DISCUSSION 



Experiments I and II 

The perceptual weight carried by release bursts and formant transitions 
as cues to place of articulation, varied with consonant, vowel, and speaker. 
No single cue, or pair of cues, was sufficient for recognition in all contexts.. 
If we take into account variations in acoustic structure, such as those outlined 
in the introduction, we can make sense of many, though not all, of the results. 

Labial. As expected, labial bursts were relatively weak cues. For . 
speaker 2 they were longer in duration and considerably more effective than 
for speaker 1. Nonetheless, the patterns of performance are quite similar for 
the two speakers; apart from an anomalous point at /e/ for speaker 2, labial 
bursts tended to be most effective before rounded vowels. Whether this is due 
to variations in burst energy or to variations in burst frequency position in 
relation to the following vowel, will become clearer when we have reported the 
results of Experiment III. Here we note simply that the rank order correlation 
between burst duration and percent recognition was not significant for either 
speaker. 

Formant transitions, on the other hand, were almost as effective for both 
speakers as the full complement of cues, before all nine vowels, except /u/ and. 
/3^/. The two exceptions are rounded vowels for which lip constriction neces- 
sarily reduces the rise in formant frequency (i.e., the extent of formant tran- 
sitions) associated with mouth opening. 

Apical . As expected, apical bursts tended to be longest and most effective 
for both speakers, before front vowels. They were weak before all other vowels 
(/3^/ is an exception for speaker 2), and seem to have become systematically 
weaker as rounding (and so front-cavity length) increased, reducing burst ener- 
gy (see Table 1). However, burst frequency may also be relevant, and we again 
defer discussion, noting only the lack of significant correlations between 
burst duration and performance. 

For both speakers (particularly speaker 2), formant transitions were strong 
cues before central and back vowels /A,a,o,u,3^/ where apical transitions are 
extensive, but weak cues before the front vowels, where transitions are rela- 
tively short. Furthermore, as might be predicted .from the longer apical than 
labial VOTs (see Table 1), addition of the devoiced to the voiced transition 
segments tended to improve recognition of /d/ more than of /b/. However, with- 
in the apical series, VOT does not significantly predict the performance gains 
from addition of the voiceless transitions. 

Velar. Speaker differences are most marked for the velar series. The 
predicted tendency for the burst to be more effective before back, rounded 
than before front, unrounded vowels was borne out for speaker 2, despite his 
somewhat longer front than back vowel bursts. However, for speaker 1, the ^ 
burst was simply a very weak cue before all vowels, except /3^/. Again, we 
note the lack of significant correlation between burst duration and performance, 
and defer comment on these results. . 
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As expected, the relatively short velar transitions were far less effec- 
tive cues than were labial and apical transitions for both speakers. At the 
same time, longer VOTs did tend to increase the effectiveness of devoiced 
transitions. For speaker 1 the largest performance gains from the addition of 
devoiced to voiced transitions were for /i/, /e/, /u/, and /3^/,.the vowels be- 
fore which aspiration durations were longest; similarly, for speaker 2 the 
largest gains were for /i/ and /a/ (see Table 1). However, the rank order 
correlation between performance and voice onset time was not significant. 

Broadly, our results agree with those of Fischer- J^rgensen (1972) for 
Danish initial-voiced stops in these respects: (1) the burst was a relatively 
effective cue for /b/ before /u/, but not before /a/; (2) the burst was a rela- 
tively effective cue for /d/ before /i/, ^bu%t not before /a/ or /u/; (3) the 
burst was a relatively effective cue for /g/ before /i/ and /u/ (speaker 2 
only), but not before /a/ (speaker 1 only). Our results disagree with those 
of Fischer- J^rgensen insofar as: (1) the burst was a relatively ineffective 
cue for /b/ before /!/; (2) the burst was a relatively ineffective cue for / g/ 
before /u/ (speaker 1 only);* (3) the burst was a relatively effective cue for 
/g/ before /a/ (speaker 2 only). 

Our results do not support the implication of Cole and Scott (1974a) that 
release bursts alone are sufficient cues to the place of articulation of ini- 
tial-voiced stop consonants. Nor, contrary to our own expectation, did the 
addition of devoiced transitions to the bursts reliably improve recognition. 
If we adopt as an arbitrary (and modest) criterion of significant perceptual 
weight that recognition performance for release-bursts-plus-vowels should drop 
by no more than 25 percent below performance for the original syllable, we see 
that this level was reached for speaker 1 on only one syllable out of 27 
(/bud/), for speaker 2 on only 13 syllables out of 27 (/bed, bod, bud, b3^d, 
did, did, ded, d3^d, gid, gad, god, gud, g3^d/) . The role of consonant-vowel 
(CV) coarticulation in determining burst effectiveness, implicitly denied by 
Cole and Scott (1974a), is suggested by the preponderance among speaker 2's 
13 syllables, of central-back> rounded vowel syllables for /b/ and /g/, of 
front unrounded vowel syllables for /d/. 

' ~ EXPERIMENT III 

The purpose of this experiment was to test the hypothesis that the initial 
release burst of /bVd, dVd, gVd/ syllables may be a functionally invariant cue 
to consonantal place of articulation across a representative set of syllable- 
nucleus types. The method was to transpose the release burst firom each CVC 
syllable in a series (labial, apical, velar) across all types of VC syllables 
in that series. For a fair test of the hypothesis we needed tokens from a 
speaker whose release bursts were known to be at least moderately effective 
cues in their original syllables. We therefore used the 27 CVC (and 9 VC 
syllables) recorded by speaker 2 for Experiment II. 

Method 

The experimental signals were constructed in exactly the same way as the 
burst-plus-vowel signals of Experiments I and II. The burst was removed from 
all 27 CVC syllables (for durations see Table 1). Each burst was then attached 
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to all nine vowel-/d/ syllables (where the vowels were again /i,i ,c,ae,A,a, 
o,u,3^/), leaving a silent interval between burst offset and vowel onset, equal 
in duration to the devoiced interval for the CVC token being simulated. The 
result was a set of 81 syllables in each series (labial, apical, velar) — a 
total of 243. 

Three repetitions of each syllable were recorded and randomized into a 
single test of 729 items. The test was administered to eight Lehman College 
undergraduates under conditions and instructions identical with those used for 
the Lehman College students of Experiments I and II. 

Results 

Figure 5 displays percentage correct identification of initial consonantal 
place of articulation as a function of following vowel for the nine bursts in 
each series. Responses were scored for place of articulation only, and voicing 
errors were disregarded. To facilitate reading, the results for bursts drawn 
from syllables containing the four front vowels (/i,i,e,ce/) have been grouped 
in the upper three graphs; the results for bursts drawn from syllables contain- 
ing the five central and back vowels (/a, a,o,u,3^/) have be^^-n grouped in the 
middle three graphs. The following vowels have been ordered along the hori- 
zontal axes to trace a path around the rim of the English vowel loop from 111 
through /a/ to /u/, with /3^/ appended, and points have been connected by 
straight lines to facilitate .reading. For untransposed bursts (i.e., bursts 
placed before the same vowel as that of the syllable from which they were 
originally drawn) the data point is circled. 

Before considering the three series separately, several general points can 
be made. First, the highest performance for a given vowel is often not elicited 
by the burst taken from the original syllable containing that vowel. For exam- 
ple, the burst drawn from the syllable /bad/ elicited a lower performance when 
attached to /ad/ (the circled point over /a/ in the middle labial graph of 
Figure 5), than did the bursts drawn from any of the other eight /bVd/ syl- 
lables. Similar, if less severe, discrepancies appear for many other syllables. 

Second, the highest recognition performance elicited by a particular burst 
is not always for a syllable containing the same vowel as the syllable from 
which the burst was drawn. This is most striking in the apical series for 
which the highest performances elicited by all nine bursts are before /id/ and 
/id/. Similarly, in the labial series, bursts drawn from all nine syllables, 
including the front vowel set, elicit their highest performances when attached 
to back vowel syllables; and in the velar series, bursts from the four central 
and back vowel syllables, /gad, god, gud, g3^d/, elicit roughly interchange- 
able performances within their own set. 

Both these results suggest a measure of commutability among the bursts of 
each series. This coimnutability becomes even more obvious as soon as we notice 
a third feature of the data, closely related to the first two: the overall 
form of the performance curves across the vowels is remarkably similar for all 
bursts within a series, whatever the syllables from which they were drawn. 
The degree of concordance among the nine curves of each series is a measure 
of burst commutability or functional invariance. Furthermore, a rather good 
description of the general curve for each series is provided by simply plotting 
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Figure 5: Percent correct recognition of place articulation in burst plus /Vd/ 
syllables. In the top two rows of figures, each point represents the 
recognition of syllables composed of a burst taken from one vocalic 
environment and transposed to each of the other vocalic environments 
(with the exception of the circled points for the nine untransposed 
bursts). In the bottom row, average correct recognition scores for 
syllables in which the bursts were transposed are compared with 
recognition scores for syllables in which the bursts were attached 
to the same vowel as that of the syllable from which they were origi- 
nally taken. 
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for each vowel the percentage of correct identification of its "untransposed" 
burst (circled points) . These cuirves are displayed in the lower three graphs 
of Figure 5, together with a plot of the mean percentage correct for the trans- 
posed bursts. The rank order correlation between these curves, that is between 
performances elicited by transposed and untransposed bursts, is then a second 
measure of burst commutability or functional invariance. 

Labial . All nine labial cuirves are roughly parallel: bursts from almost 
every syllable elicit their highest performance before the central-back rounded 
vowels, /a,o,u,3^/, a moderate performance before /e/ (before /ae/ for the /bud/ 
and /b3^d/ bursts), and relatively weak performances before /i,i,ae,A/ (except 
the peak for the /bid/ burst before /i/). Kendall's coefficient of concordance 
(W) among the nine curves is .79 (p < .0001). This significant similarity in 
pattern of burst effectiveness (or sufficiency) demonstrates that the nine 
bursts are, to a large degree, functionally invariant. However, Spearman's rho 
between untransposed and mean transposed curves of the bottom labial graph falls 
short of significance with a value of .53. This failure is clearly due to the 
peaks for the untransposed /bid/ and /bed/ bursts and suggests that release 
bursts effective in signaling labiality before /i,E/ may be context-dependent. 

Apical . All nine apical curves are roughly parallel; bursts from every 
syllable elicit their highest performances before /i/ and /i/, and apart from 
fair performances for the /did/ and /d3^d/ bursts before /e/, and for the /dY^dJ 
burst before the back vowels and /3^/, relatively weak performances elsewhere. 
Kendall's W among the nine curves is .72 (p < .0001). Spearman's rho between 
the untransposed and the mean transposed curves of the bottom graph (Figure 5) 
is .60 (p=.05), clearly pulled down by the peak for the untransposed /d3^d/ 
burst. The apical bursts like the labial bursts, are to a large degree func- 
tionally invariant. 

Velar . The curves for the velar bursts fall into two distinct groups — 
front and central-back vowels. The front vowel bursts elicit moderate per- 
formances before /i,i,e/ and, apart from a small peak for the /bed/ burst before 
/a/, weak performances elsewhere. The central-back vowel bursts elicit their 
highest performances before /a, o,u, 3^/, weak performances elsewhere, though with 
a tendency for slightly stronger performances before /i,i/. There is thus a 
small asymmetry; while front vowel bursts do not concord with back vowel bursts 
before back vowels, back vowel bursts tend to concord with front vowel bursts 
before front vowels. As a result, Kendall's W among the nine curves, though 
significant (p < .001), is low (.37). However, if. we separate the two groups 
and compute Kendall's W within them, wefind for the front vowels, .69 (p < .05), 
and for the central-back vowels, .66 (p=.01). The increased coefficients 
justify separating the bursts into two groups. Accordingly, the transposed 
burst curve of the bottom graph (Figure 5) was computed for front vowels and 
for central-back vowels separately. The result is an excellent fit between 
transposed and untransposed curves, for which Spearman's rho is . 88 (p < .01). 
There is therefore a large degree of functional invariance among the velar 
front vowel bursts and among the velar central-back vowel bursts. 

Discussion 

While the release bursts of initial labial, apical, and velar stops dis- 
play a high degree of functional invariance,* they do not display a corollary 
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degree of sufficiency. In all three experiments, the release burst was seldom 
s ficient to maintain performance at the level elicited by the original syl- 
lable. Vowel-dependent variations in performance are therefore less aptly 
characterized as variations in "sufficiency'' or "cue adequacy", than as vari- 
ations in the degree to which the burst may be assumed to contribute to the 
cue complex in natural speech (cf. Stevens, 1975). 

In Experiment III, identification of the original syllables was perfect 
except for /gid/ which the listeners identified with 87 percent accuracy. If 
we again adopt as an arbitrary criterion of significant perceptual weight that 
the performance on the untransposed burst-plus-vowel should drop by no more 
than 25 percent below performance on the original syllables, we arrive at the 
following set of 14 out of 27 syllables for which the release burst carried 
weight in judgments of place of articulation in either or both of Experiments II 
and III: /bid, bed, bod, bud, b3^d, did, did, ded, d3^d, gid, gad, god, gud, 
g3^d/. 

These results bring us into closer agreement with both Fischer-JjJrgensen 
(1972) and Cole and Scott (1974a), since the untransposed burst carried signif- 
icant weight for /b/ before /i/ in Experiment III. The results also agree very 
well with those of Liberman, Delattre and Cooper (1952). These authors used 
the relatively crude Pattern Playback II synthesizer to construct schematic 
stop bursts before "seven two-formant monotone vowels, /i,e,e,a,o,o,u/. Iden- 
tifications reached 75 percent or higher for /p/ before /i,e, e , o, o,u/ , for /t/ 
before /i,e,e/, for /k/ before /a,o,o,u/. Considering only the vowels common 
to both experiments, these results agree with our own in finding bursts to 
carry weight as labial cues before /i,e,o,u/, as apical cues before /i,e/, as 
velar cues before /a,o,u/. The only discrepancy between the two sets of results 
is in our finding that a release burst carried weight . as a velar cue before /i/. 
This remarkable agreement between the present natural speech study and an ex- 
periment carried out with primitive synthetic speech 25 years ago, suggests 
that the systematic variations in burst effectiveness common to both experiments 
reflect a robust perceptual process. 

The most obvious source of these variations might seem to lie in release 
burst energy. Unfortunately, we were not able to make reliable intensity 
measuremei^s of the release bursts in the present study. However, a scan of 
the syllables for which release bursts proved adequate and of their durations 
in Table 1, will reveal no obvious correlation, and as reported above. Spearman' 
rho between burst duration and performance was not significant for any series. 
Furthermore, since all schematic bursts sjmthesized by Liberman, Delattre, and 
Cooper (1952) were of equal energy, this factor cannot account for their results 
Thus, while variations in burst energy may well account for variations in the 
overall performances elicited by particular bursts or in the recognition of 
different tokens of a particular stop-vowel syllable (and so for the different 
levels of performance elicited by the bursts of speakers 1 and 2), they cannot 
account for systematic variations in burst effectiveness across vowels. 

The case is no better when we turn to the absolute spectral properties of 
release bursts. For example, as remarked in the introduction, spectral sections 
taken through the apical release burst show a broad high intensity curve over 
frequencies above about 2000 Hz, largely independent of the following vowel 
(see Figure 2). We can hardly, therefore, appeal to the absolute spectral 
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properties of the apical burst to explain the fact that the burst carries ap- 
preciable weight before high front vowels such as /i,i/, but essentially no 
weight before central-back vowels such as /A,a,o,u/, 

In fact, the key to the problem may be provided by the work of Kuhn (1975). 
First, he draws on the acoustic theory of speech production, according to which 
the resonance of the cavity in front of the point of maximum tongue constric- 
tion — that is "the front cavity resonance" — may be associated with any of the 
first four formants (Fant, 1960:72) • He then shows that "the front cavity seems 
to be associated with what is perhaps the most intense group of formants: 
with the F3 group for /i,i,e,ae/, and with the F2 group for /a,A,u,u/" (Kuhn, 
1975:430). Next, he demonstrates that a front cavity frequency estimate can 
be most readily made for the more constricted vowels and for highly constricted 
consonants, and that for stop consonants the estimate may be derived from the 
spectral structure of bursts and transitions. Since the front cavity resonance 
is a function of front cavity length, and since front cavity length is a func- 
tion of the place of articulation, an estimate of the resonance is tantamount 
to an estimate of place of articulation. Finally, a variety of evidence from 
synthetic speech experiments (e.g., Liberman et al. , 1952) suggests that place 
of articulation is most readily conveyed by stop consonant bursts when their 
spectral weight lies close to the front cavity resonance of the following vowel. 
Proximity on the frequency scale may facilitate perceptual integration of the 
burst with the vowel, so that the listener can track the changing cavity shape 
characteristic of a particular place of articulation followed by a particular 
vowel. This hypothesis can account for many of the variations in burst effec- 
tiveness observed in Experiments II and III. 

Labial. The low frequency labial bursts carried significant weight (by 
the criterion defined above) before /o,u,3^/in both experiments, and close to 
significant weight before /a/ in Experiment II, and before /a/ in Experiment 
III. For all these vowels the front cavity is strongly associated with the 
second formant and the frequency of that formant lies below 1000 Hz, a region 
over which the greatest weight of labial burst energy is distributed. The 
variability in response for /a, a/ may be due to weaker front cavity-to-f ormant 
affiliation in less constricted vowels, and the consequent difficulty for the 
listener in continuous tracking of the changing front cavity resonance in the 
absence of a formant transition. 

The two other vowels before which labial bursts carried significant weight 
were /i,e/, for which the front cavity is strongly associated with the third 
formant. However, the untransposed bursts were notably more effective than 
the transposed and, as remarked above, this suggests a degree of context de- 
pendency. The rapid and relatively extensive lip opening before unrounded 
vowels and the consequent rapid rise in resonant frequencies, may extend the 
burst frequency range sufficiently high for it to be integrated with F3 of 
the following vowel. The ineffectiveness of the burst before /ae/ may again 
be due to weaker front cavity-to-f ormant affiliation in a less constricted 
vowel, and the resulting difficulty for the listener. However, the ineffec- 
tiveness of the burst before /i/ is unexplained. 

Apical . Apical bursts carried significant weight before /i/ in both ex- 
periments and before /i,e,y/ in Experiment II, although performance was very 
weak for most bursts before /e,3^/ in Experiment III. On the assumption that 
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the high frequency apical burst can be integrated perceptually with the front 
cavity resonance of F3 for the high front vowels /i,i/, but less readily, if 
at all, with the less determinate front cavity resonance of the more open vow- 
els /e,a5/, or with the low frequency front cavity resonance of F2 for the cen-' 
tral-back vowels /A,a,o,u/, these results are very much what we would expect. 
Nonetheless, there are oddities. For example, it is not clear why the /did/ 
burst (duration 15 msec) should have been more effective before / i/ than was 
the untransposed burst from /did/ (duration 25 msec) in Experiment III. Nor 
is it clear, given the moderate duration of the /d3^d/ burst (10 msec), why 
it should have been a strong cue (91 percent) before the low front cavity res- 
onance of F2 for /3^/ in Experiment II and a moderately strong cue before 
/a, 0,3^/ in Experiment III. 

Velar. Velar bursts carried significant weight before /i,a,o,u,3^/ in 
both experiments. It will be recalled that the spectral weight of velar bursts 
tends to lie close to the F2 frequency of the following vowel. Perceptual in- 
tegration of the burst with the front cavity resonance of F3 for the front vow- 
els should therefore be easiest when F2 and F3 lie close together as in /i/, 
precisely as observed. For the central-back vowels /a,o,u,3^/, variation in 
F2 frequency, and so of velar burst frequency, is small (roughly from 600 to 
1000 Hz). We might therefore expect that velar bursts from all four vowels 
would be readily commutable and accessible to perceptual integration with the 
front cavity resonance of F2- Again, this is precisely what was observed. 
The systematic decline in performance as F2 (and so velar burst frequency) de- 
creases from /i/ to /ae/ (see Figures 4 and 5) suggests that the ineffectiveness 
of velar bursts before /i,E,ae/ may be due to the increasing separation of burst 
and front cavity resonance (F3) on the frequency scale. The inadequacy of the 
burst before /a/ may arise from the relatively weak front cavity-to-f ormant 
affiliation for this vowel. 

In short, despite several unexplained oddities in the data, our perceptual 
integration hypothesis provides a remarkably close account of the variations 
in burst effectiveness in Experiments II and III. At the same time, this account 
affords insight into the grounds of functional invariance among stop release 
bursts. Bursts are invariant insofar as they all bear the- same relation to any 
particular following vowel. The relation is that of spectral continuity or 
discontinuity with the main (or front cavity) resonance of the following vowel. 
If there is continuity (as in an apical burst followed by /i/, for example), 
the relation contributes significantly to recognition of consonantal place of 
articulation; if there is discontinuity (as in an apical burst, followed by /a-/, 
for example), the relation does not contribute significantly to recognition. 
The invariance is therefore not a simple first-order invariance based on the 
absolute frequency and/or amplitude of the bursts. Rather, it is a higher or- 
der relational invariance based on spectral relations between burst and follow- 
ing vowel. 

The general conclusion that the contribution of the burst to the cues for 
place of articulation depends on the following vowel, is not new. Liberraan, 
Delattre, and Cooper (1952) remarked of their schematic /p/ and /k/ bursts be- 
fore schematic vowels that: "...the irreducible acoustic stimulus is the sound 
pattern corresponding to the consonant-vowel syllable" (p. 516). While neither 
Fant (1959, 1969) nor Stevens (1975) believes that the perceptual process al- 
ways requires reference to the vowel, both describe the burst in natural speech 

23 



30 

EKLC 



as dependent on context for its effect. Stevens (1975) deliberately eschews 
a description in terms of release bursts and individual formants, since this 
would imply that these components have independent roles in the cue complex. 
He emphasizes rather "the overall ^xoustic spectrum immediately following the 
release" (p. 311). However, regarding the contribution of the burst to this 
spectrum he writes: 

"We shall assume that this can be considered as the initiation of the 
rapid spectrum change at the consonant release, if there is spectral 
energy in the burst in the vicinity of the major spectral peak for the 
vowel, • . . Thus the initial burst of energy in syllables beginning 
with /g/, and the burst for syllables with a front vowel preceded by 
, /d/ would be considered as part of the rapid spectrum change, since 
'* major energy concentrations in these bursts occur in frequency regions 

where the vowel formant transitions are providing cues for place of 
^ articulation of the consonant • The d-burst in a syllable with a back 
vowel, on the other hand, would not be considered as an integral part 
of the rapid spectrum change.... The burst at the onset of the conso- 
nant /b/ is relatively weak, and may not play a significant role in 
shaping the rapid spectrum change." (Stevens, 1975:312-313). 

The present study suggests that, at least for some speakers and listeners, 
the contribution of the /g/ burst may not be as strong for open vowels as closed 
vowels, and that the contribution of the /b/ burst may not always be insig- 
nificant. It is precisely to an understanding of such detailed variations that 
Kuhn (1975) has added by identifying both the burst and "the major spectral 
peak for the vowel" with the front cavity resonance. In short, Stevens' gen- 
eral description of the conditions under which the burst contributes to the 
spectral changes following release is consistent both with Kuhn's (1975) front 
cavity analysis and with our own results. In the following discussion, we 
attempt to develop some implications of these results for the perceptual pro- 
cess • 

GENERAL DISCUSSION 

An important feature of the results of Experiments I and II was the ten- 
dency toward reciprocal performances on bursts and transitions; where the per- 
ceptual weight of one increased, the weight of the other declined. These re- 
ciprocal relations follow systematically from the acoustic structure of the 
syllable. Where transitions are brief (for /b/ before rounded vowels, for /d/ 
before high front vowels, for /g/ before close vowels), the burst lies near 
the main formant of the following vowel and contributes significantly to the 
perceptual outcome; where transitions are extensive (for /b/ before middle, 
unrounded vowels, for /d/ before central-back vowels), the burst is distinct 
from the main formant of the following vowel, and contributes little. If we 
combine this observation with the conclusions of Experiment III, we are led 
to recognize that, wherever bursts and transitions contribute significantly to 
the perceptual outcome, they are acoustically and functionally (that is, per- 
ceptually) equivalent; both provide a spectrally continuous change from the 
consonantal release into the following vowel by which the listener can estimate 
place of articulation. To say that they are equivalent is not, of course, to 
say that they are alternative. In natural speech, as we have already emphasized, 
it must be rare that a listener relies on burst alone or on transition alone. 
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and in Experiments I and II, a single cue was Wt often sufficient to hold 
recognition at the level of the original syllable. Bursts and transitions 
are equivalent and complementary. 

Once again, this observation is not new. Over twenty years ago Cooper 
et al., (1952) remarked that "bursts and transitions complement each other in 
the sense that when one cue is weak, the other is usually strong." (p. 603). 
In a similar vein, Fischer- Jtf5rgensen (1954) commented on synthetic speech 
studies: "The listener does not compare explosion with explosion and transi- 
tion with transition, but compares artificial syllables comprising either ex- 
plosion or transition with natural syllables that always contain both" (p. 56). 
Finally, Fant (1959; 1960:217) has repeatedly emphasized that the qualitatively 
distinct acoustic segments during the first 10-30 msec after release are prob- 
ably not auditorily discriminable and "should be regarded as a single stimulus 
rather than as a set of independent cues" (Fant, 1969:21). And, as we saw 
above, the acoustic and functional inseparability of burst and transition is 
implicit in "the rapid spectrum changes" following release that Stevens (1975: 
311) describes. In short, the opposition between invariant burst cues and 
variable transitional cues, imagined by Cole and Scott (1974a, 1974b), is false. 
Far from being opposed, bursts and transitions are functionally identical. 

In conclusion, the results of the present study, and, in particular, the 
apparent functional equivalence of release bursts and transitions, suggest that 
the perceptual process may entail continuous tracking of vocal tract resonances. 
The importance of transitional information for the recognition not only of stop 
consonants in many contexts, but also of /w,r,l,y/, nasal consonants, frica- 
tives and perhaps even vowels (Lindblom and Studdert-Kennedy , 1967; Shankweiler, 
Strange and Verbrugge, in press) is attested by an extensive literature (for 
review, see Liberman et al. , 1967; Stevens and House, 1972; Studdert-Kennedy, 
1974, 1976; Darwin, in press). We do not doubt that the acoustic invariants 
for these phonetic segments may eventually be specified; however, we see little 
grounds for expecting that they will be specified without reference to context. 
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Modes of Perceiving: Abstracts, Comments, and Notes* 
M. T. Turvey and Sandra Sears Prindle 



INTRODUCTION 

Intuitively, deliberations on modes of perceiving are intended to flesh 
out something of the special manner in which humans apprehend their world. In 
principle, the importance of the enterprise lies in the fact that even an ele- 
mentary cataloging of modes would significantly fetter the construction of the- 
ories of perception and cognition. It goes without saying that in evolving 
the perceptual styles of humans and animals, nature did not build "general- 
puripose machines," but rather "special-purpose machines"; and whatever plas- 
ticity humans and animals manifest it is a "special-purpose plasticity." Never- 
theless, one has the impression that often theory-making proceeds untrammeled 
by a serious consideration of natural constraints and seems to be oriented to- 
ward a general-purpose, context-free perceiver. 

While it is the case that deliberating on modes of perceiving is well mo- 
tivated, unfortunately it is not immediately obvious what it is that one is 
deliberating. The concept of "mode" is an intuitive object; tacitly we can 
appreciate the catalytic value of the concept in thinking about matters of per- 
ceiving and knowing, but we cannot say precisely and -unequivocally what a mode 
is. Partly in response to this equivocality, our approach to summarizing the 
volume-^ takes the following form. First, we precis the various papers convey- 
ing, ideally, the larger point made by each author. Second, we seek funda- 
mental themes which weave these larger points together in the hope that these 
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themes will identify major constraints on the theory of perception. Third, 
and separately, we gather some of our elementary and rough thoughts on the ab- 
stract notion of ^'style/* These we present as notes toward a tenable charac- 
terization of the concept of mode in psychology, and in this respect our remarks 
may be regarded as complementing those of Pick and Saltzman in the initial chap- 
ter of Psychological Modes of Perceiving and Processing of Information. 

ABSTRACTS AND COMMENTS 

A contrast that comes rapidly to mind when one thinks of modes of percep- 
tual processing is that of unconscious and conscious, or as Posner and his col- 
leagues describe it, the contrast of automatic and attentive. Processing of the 
former kind, we are told by Posner et al., is very much a parallel affair while 
that of the latter kind is considerably more serial. The significant consequence 
of attentive processing is that it consumes a portion of the limited resource 
capacity, thereby curtailing the processing of other concurrent signals, and 
further, that it induces inertia in the processing apparatus. When there is the 
intentional selection (attentive) of a particular psychological channel or path- 
way, it takes effort and time to shift attention to another channel when needed. 
The costs, therefore, of attentive processing are manifestly plain; among its 
benefits we may suppose, is a finer grain of analysis. 

Inasmuch as the mode of attentive processing can be set by instruction we 
may ask: To what, precisely, is my processing directed when I am instructed to 
attend to a given location? It is this question which guides the series of 
ingenious experiments reported by Posner, Niessen, and Ogden. The conclusion 
is curious and provocative. Apparently there is little benefit to be gained 
by knowing ahead of time the external location at which a signal will occur if I 
do not know the modality which will convey the signal. By inference, attentive 
processing cannot be directed to a location with the same efficacy that it can 
be directed to a modality; preference is for knowing the messenger rather than 
knowing from where the message is coming. 

With respect to the inertia induced by the mode of attentive processing, 
Posner and colleagues (Posner, Niessen, and Kline, 1976) have recently inter- 
preted the peculiar phenomenon of visual capture as being indicative of an in- 
ertial assymetry between switching from vision to another modality, and switch- 
ing from another modality to vision. One is reminded that visual capture refers 
to the dominating role that vision has in the human conscious experience. When 
the information for vision and another modality are in conflict, vision is the 
likely victor. Thus, I will experience my hand as tracing out a curved line 
when in fact it traces a straight line that has been prismatically distorted for 
visual consumption (Gibson and Radner, 1937). The relation between visual dom- 
inance and the inertial aspects of attentive processing is thus expressed; ex- 
periment suggests that vision is not an especially efficient alerting system 
because the time to switch into vision from another modality significantly ex- 
ceeds the time to switch between two nonvisual modalities. If the human animal 
was not in the visual modality at the time of occurrence of an ecologically sig- 
nificant optical signal, it would be, on this account, at a distinct disadvan- 
tage. Consequently, one hypothesizes that in response to evolutionary expediency, 
nature saw fit to bias human conscious experience toward the visual pickup of 
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information. That the bias is software rather than hardware is suggested by 
the following observation made when prismatic distortion of vision accompanies 
haptic exploration: if vision is attended to, the haptic system undergoes an 
adaptive shift, but if the haptic system is attended to, it is vision that is 
recalibrated (Kelso, Cook, Olsen, and Epstein, 1976). With other things being 
equal, it is vision that is attended to by choice. 

Herein lies a rationalization of the "primacy of vision" which dovetails 
with Lee*s deliberations, for these also sought to express the supremacy of 
visual perception. We shall see that while the account of visual primacy de- 
rived from Posner's work emphasizes the costs of vision, that the account of 
Lee's emphasizes the benefits of vision. 

The term modality enjoys considerable usage. It is a term befitting the 
convention of classifying senses according to the qualitatively different con- 
scious experiences. Following this convention, the special sense of vision is 
a source of visual sensation and the special sense of proprioception is a source 
of sensation of one's own movements. It has been remarked by Gibson (1966), 
and echoed enthusiastically by Lee, that it is far more sensible to classify the 
senses in terms of activities such as looking and listening than in terms of 
passive conduits tr^^nsporting qualitatively different sense data. When approached 
from this perspective, the term "perceptual system" is substituted for the term 
"senses." And whereas the fundamental role assigned to the senses is that of 
providing raw materials for the creation of conscious experience, the fundamental 
role assigned to perceptual systems is that of obtaining information in the ser- 
vice of activity, as Lee so elegantly puts it. 

A promissory note of Gibson's (1966) approach is that different perceptual 
systems can be sensitive to the same information. Here information is defined 
as information about the environment in a sense of specificity to it; and it 
is this sense of the term that is intended by Pick and Saltzman. The claim is 
that the pickup of information of a given type is not necessarily the preroga- 
tive of any one perceptual system. It is a claim that is easily glossed over 
by students of perception but its ramifications are considerable (White, 
Saunders, Scaddon, Bach-Y-Rita, and Collins, 1970); for those who think in terms 
of special senses — or special modalities — it is anathema. 

Lee reminds us that in the regulation and control of activity three kinds 
of information are needed: information about surface layout and events; infor- 
mation about relations and changing relations among the limbs; and information 
about the motion of the body relative to the environment. His argument is that 
vision supplies all three — it is trimodal — and does so better than the other 
perceptual systems. Hence, we have the "primacy of vision." Essentially, 
vision's relation to the other perceptual systems is that of overseer: vision 
tunes and calibrates those systems which would otherwise be imprecise sources 
of information relative to the guidance of activity. A dramatic demonstration 
of vision's role with respect to body-related (proprioceptive) information is 
provided by Gross, Webb, and Melzack (1974). When asked to plot the position 
of an arm which rested without moving and out of view (it was hidden by an 
opaque shield), participants could do so quite accurately if the delay from 
last seeing the arm was relatively short. However, with the passage of time the 
position of the resting arm was felt to migrate to one of two positions — flec- 
tion-adduction or extension-abduction. 
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It is but a small leap from Lee's paper to Mack's- For the contentions 
between traditional and Gibsonian perspectives — between indirect and direct 
realism~that were merely interlineal in Lee's paper are brought to focus in 
Mack's paper. A departure point is that thorny issue on which Boring and Gibson 
collided: Can the visual world be apprehended independently of the visual 
field? Its cognate is perhaps better known: Can perception be indifferent to 
sensation? This issue takes many forms that are by no means identical. The 
gist, however, is unmistakable: it is a matter of whether the world can be per- 
ceived first hand — directly, or only second hand (by virtue of some surrogate) — 
indirectly. 

Mack distinguishes between proximal and constancy perception. Put bluntly, 
proximal perception is determined solely by the absolute properties of the ret- 
inal image; in contrast, constancy perception is determined by these image prop- 
erties only partially, or not at all. Obviously the central concept is that of 
the retinal image and we may, after Gibson (1950), identify two versions of the 
concept for they are of significance to Mack's remarks. In one version, the 
image is defined as the anatomical pattern of cells that are excited — this we 
call the anatomical image; in the other version, the image is defined as the 
ordinal pattern of excitations indifferent to the location of cells excited — 
this we call the ordinal image. It was Gibson's (1950) intuition that seeing 
in terms of the anatomical image and seeing in terms of the ordinal image were 
two different ways of seeing, two different modes, if you wish. 

Generally, when one talks about the retinal image it is the anatomical im- 
age one has in mind. Related to this conception is a tendency to talk about the 
light at an eye in terms of Euclidean geometry and thus to emphasize absolute 
metrical values. Euclidean geometry was all that was known to the ancients and 
to the intellectual ancestry who established the conventions and fundamental 
assumptions of contemporary visual theory. In contrast, the conception of the 
ordinal image encourages the adoption of projective geometry and its emphasis 
on abstract relations preserved over projective transformations. 

When one describes the retinal image or proximal stimulus in Euclidean 
terms there is an apparent lack of correspondence between the image and its dis- 
tal referrent. Consequently, insofar as perception tends to be veridical, it 
follows that the light at an eye underdetermines perceptual expc :ience. The 
appropriate perception arises by virtue of processes which supplement the retinal 
image. Most generally these processes are thought of as memorial or problem 
solving in nature. The observer in this, the traditional point of view, is 
much like Sherlock Holmes who must attempt to deteirmine what actually trans- 
pired from the limited data or available clues. We refer to this point of view 
as constructivism » in order to emphasize the central hypothesis that visual 
perception is built out of a number of ingredients — some of which are provided 
by the retinal image and some of which are provided by other extra-visual 
sources (Turvey> 1974, 1975). 

Let us now return to Mack's three modes of perception. By all accounts 
the proximal mode is evident only when the conditions of observation are 
highly constrained; for example, a two-dimensional nonchanging display exposed 
briefly against a homogeneous background and viewed from a stationary point of 
observation. In a phrase, the mode of proximal perception is precipitated by 
impoverished stimulation. 
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The subject-relative constancy mode is most obviously an example of con- 
structivism, for the ingredients in the perception recipe include absolute and 
local anatomical image properties and nonvisual information about eye, head, and 
body orientation. In subject-relative constancy one must go beyond the light 
to an eye in order to determine perceptual experience. By our interpretation 
subject-relative perception uses the anatomical image. And \^hat we would like 
to believe is that only in rare and artificial circumstances does the anatomical 
image play a determining role ±n experience. In short, operating in the prox- 
imal and in the subject-relative constancy mode are unnatural recourses for the 
visual perceptual system. 

We are led therefore to the point of view that object-relative constancy 
perception is representative of the style in which the visual perceptual system 
maintains contact with the environment. In the object-relative mode, abstract 
relations in the structured light at an eye provide the optical support for 
visual perception without supplementing by nonvisual data. Mack informs us 
that visual perception in the laboratory may sometimes be in error because of 
the curious bias of the visual system to operate in the object-relative mode 
when the subject-relative mode is more felicitous for the conditions of observa- 
tion. But we should not be surprised by this fact. If it is the case that the 
optical flow pattern at a moving point of observation is structured adjacently 
and successively in ways that are. specific to the observer's movement and to the 
properties of the environment as Gibson (1966) and Lee argue, then we should 
suppose that evolution optimized the visual perceptual system of humans and beast 
to be sensitive to this structure. It is the abstract relational information 
in the ordinal image understood as the ambient optic array and not the -metrical 
character of the anatomical image, which has constrained the evolution of visual 
systems. And if that invariant information is specific to the environment, then 
as the optical support for visual perception it merely has to be detected; it 
would not have to be supplemented by other sources, of knowledge. 

Let us summarize to this point. Our quest for the natural style in which 
humans perceive has realized two dividends. One is that — ceteris paribus — vision 
preempts conscious experience because it is the most abundan.t supplier of infor- 
mation about the environment and about one's self; as far as perceptual systems 
go, it is potentially more costly not to be visually attentive, and considerably 
more laborious to become so. The other dividend is that, although visual per- 
ception may operate in a subject-relative or constructivelike mode, this is 
not its more natural and preferred style. We pursue the latter proposition in 
the paper of Shaw and Pittenger. 

As remarked earlier, theorizing on matters of visual perception has tended 
to begin with the retinal image understood as an anatomical arrangement. We 
can. f.urther connnent that theorizing has tended to begin with the understanding 
of the retinal image as a static bidimensional form. The consequence of this 
attitude is two-fold; first, the analysis of pattern or form perception is taken 
as propaedeutic to the theory of visual perception; and second, that change, de- 
fined as the transformation of an object over time, is said to be inferred from 
a succession of static retinal images. 

The conceptualization of the optical support for visual perception as static 
and bidimensional has a long tradition. We owe to the 10th century Arab scholar 
Al Hasan the first comprehensive exegesis of the relation between the image 
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on the retina and visual perception. Through Berkeley and Von Heltnholtz the 
tradition has been popularly maintained, and it is the source of the funda- 
mental though rarely commented on, suppositions of contemporary visual informa- 
tion processing theory and research (see Niesser, 1967; Haber, 1971). Obvious- 
ly, if we assume that a two-dimensional static description of the world is the 
starting point of visual experience, then we have identified the task of per- 
ceptual theory; to explain the means by which we arrive at static three-dimen- 
sional descriptions (depth perception, object perception) and dynamic three- 
dimensional descriptions (events). As we have already anticipated, ,the tradi- 
tional explanation is that such perceptual experiences are constructed with the 
assistance of memory. 

Suppose, however, that our intuitions about perception are guided not by 
history and the retinal image, but by the concepts of evolution and ecology. 
Such being the case, we would recognize that locomotion and the continuous orient- 
ing of the perceptual apparatus to the environment are the sine qua non of suc- 
cessful adaptation. We would recognize, in short, that dynamically transform- 
ing optic arrays would be the norm and that static frozen optic arrays would 
be the exception. Furthermore, we would appreciate that an animal would wish 
to know not simply what kind of object it was looking at but what kind of change 
the object was undergoing. Perception of the forms of change is of paramount 
importance to adaptation. In sum, from an evolutionary /ecological perspective 
we might be led to conjecture that the proper point of departure for a theory 
of visual perception is kinetic events, and not two-dimensional static forms 
(Gibson, 1966; Johannson, 1974). This conclusion is cognate with the one that 
we reached in our discussion of Mack's paper. 

An event, Shaw and Pittenger inform us elsewhere (Pittenger and Shaw, 1975), 
is composed of two things: the object or complex of objects undergoing the 
change and the change itself. The optical support for the perception of the 
former (the object) is referred to as the structural invariant, and the optical 
support for the perception of the latter (the change) is referred to as the 
transformational invariant. This understanding of the structure of events fol- 
lows from Gibson's working hypothesis of ecological optics, namely, that for 
any isolable environmental property there is a corresponding isolable property 
in the transforming optic array, however complex . By arguing that there are 
higher-order invariants specific to the styles of change, Shaw and Pittinger 
express the unorthodox view that the perception of change is direct. They ar- 
gue, in paraphrase of Gibson's notorious aphorism, that the perception of change 
is not based on the perception of static forms but, rather, on the detection 
of formless invariants over time. 

Recent examinations of comparatively simple events such as an object mov- 
ing at constant velocity or accelerating from one position to another, reveal 
that the perceptions of velocity and acceleration are not based on the prior 
discriminations of spatial and temporal event (cf. Lappin, Bell, Harm, and 
Kottas, 1975; Rosenbaum, 1975). Explanations of perceived velocity and accel- 
eration in the constructivist mode would necessitate epistemic mediation, for 
example, having discriminated at least two spatial positions—taking two retinal 
snapshots—and having monitored the time elapsed between the two positions, 
then velocity could be computed by means of a simple formula. The evidence, 
however, favors the view that velocity and acceleration are not constructed 
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percepts but directly perceivable attributes of stimulation. This conclusion 
reflects the larger point that Shaw and Pittenger wish to make, namely, that 
the nominalistic attitude toward accounts of perceptual experience is funda- 
mentally in error. We can phrase this differently and positively; what Shaw 
and Pittinger wish to emphasize is the primacy of the abstract . 

If this thesis is not already foreign enough for most stiidents of percep- 
tion to appreciate, it is made all the more so when one considers that in our 
lifetimes events range from the order of milliseconds to the order of years. 
How is it possible, we ask, to apprehend slow events without the mediation of 
memories? Uhat can it possibly mean to detect the transformational invariant 
of a slow event such as, say, aging? Shaw and Pittenger indicate the direction 
we might take in search of an answer. More tangibly, they lay bare the absurdity 
of the conventional story of memory mediation. For if my apprehension of a slow 
event comes from memory, then I must have some way of collecting the relevant 
memories, and this implies that I have knowledge of the transformation that 
relates them to each other. But the transformation that relates the memories to each 
other is what I have to infer; it cannot be presupposed. Even if we permit a 
fortuitious gathering of the relevant memories, the memory mediation story fails 
to work; for now we must attribute to the inferential processes a^ priori know- 
ledge of transformations in order that we might infer from the nominal data 
which event transpired. 

In the preceding paragraphs we have developed the intuitive notion that 
visual perceptual theory should be anchored in event perception, that is, in 
the perception of the transforming optic array. Obviously, within such a frame- 
work a static two-dimensional arrangement must be regarded as a type of "frozen" 
event in which the structured light at an eye has been reduced in its efficiency 
as a specifier of environmental facts. Belaboring the point somewhat, we may 
claim that truly static perception is artifactual arising at a relatively late 
phase in evolution. The perception of paintings, photographs, and the like 
exemplify the limiting case — and it is just this kind of perception that is ex- 
amined by Hagen. Her questions are straightforward and they follow naturally 
from the preceding remarks: Is perceiving pictures much the same as perceiving 
the ordinary environment, or is there something special going on with pictures? 
Is there either something special about the information pictures contain or 
something special that we do with that information? As we might anticipate, 
Gibson's intuitions on these matters are essentially that the perception of pic- 
tures and the perception of the scenes they depict do not differ qualitatively, 
for the essence of pictures is that the information they convey is structurally 
equivalent to that of the scenes they depict. In a phrase, picture perception, 
like event perception, is not epistemically mediated. 

Experimentation with a wide range of conditions reveals that when pictures 
(slides, photographic prints, line drawings) are from the right station point 
and apparently equate static monocular surface-layout information, the percep- 
tion of the real scene is always superior to that of the facsimile. This could 
be because of the perceptual advantages in moving the eye over a real scene 
rather than over a picture. Alternatively, as Hagen suggests, it could be be- 
cause, when faced with the task of appreciating the three-dimensional structure 
specified by the pictorial information, one must suppress the concurrent infor- 
mation specifying that the "frozen" event is actually two dimensional. In either 
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case, the Glbsonian thesis (Gibson, 1971) that picture perception can be direct like 
ordinary perception (that is, not epistemically mediated) is not appreciably harmed . 



A different conclusion, however, is Implied by the "Pirenne paradox." An 
observer's appreciation of the three-dimensional scene depicted by a two-dimen- 
sional picture is significantly enhanced when he or she adopts the wrong station 
point. This is paradoxical inasmuch as the perspective information provided 
by a picture is only equivalent to that provided by a real scene at the center 
of projection for the picture. Pirenne's interpretation of this paradox is 
clearly in the constructivist mode. Looking at a picture off-center enhances 
one's awareness of flatness and induces one to use knowledge about the internal 
components of the picture; by so doing one not only compensates for the perspec- 
tival asynchrony, but in addition and more importantly, facilitates the percep- 
tion of the internal components. The problem with this interpretation as we see 
it. Is that it is not obvious why viewing a picture from an incorrect station 
point should trigger a compensatory attitude any more than the actual knowledge 
that one is in the context of picture-viewing. We venture that a more useful 
approach to the Pirenne paradox lies in noting in what ways a perspective from 
the wrong station point could be more informative about the internal components 
than a perspective from the correct station point. Is it that the perspective 
accompanying an off-center station point specifies the perspective at the on- 
center station point; in short, that at the wrong station point one has, in 
some curious fashion, two perspectives on the static object? 

All this concern with perception from particular points of view and with 
the perception of pictures as a possibly particular kind of seeing leads us 
without too much difficulty, to perceiving — more precisely to visualizing — from 
no particular point of view. Exemplary of such visualizing is imaging; it has 
been Paivio's contribution to restore imaging to respectability in academic 
psychology. 

The mechanisms of imaging are part and parcel of a "nonverbal" system 
which is said by Paivio to mediate both our experience of the environment and 
our nonverbal actions. This imagery system operates independently of the 
"verbal" system which supports our linguistic endeavors whether they be per- 
formed by ear, eye, or hand. It is the case, as Paivio argues, that the verbal 
system is dependent on the nonverbal, for while the former communicates what 
we know about the environment, the latter is the primary source of that know- 
ledge. Nevertheless, the two systems are distinguished by the kinds of objects 
which comprise their respective memory components. For the imagery system the 
objects are said to be perceptual analogs, while for the verbal system they 
are discreet linguistic entities (for example, words). 

But how should we characterize the perceptual knowledge that Paivio refers 
to? On the assumption that the relevant entities are discreet and static images, 
we might use symbolic logic, formal grammars, machine theory, and the like to 
characterize them. On this assumption an image could be treated as a symbol, 
and perceptual knowledge viewed as a symbol manipulating system. Since language 
can be similarly characterized, the possibility arises that Paivio's imagery and 
verbal nodes are fed by one and the same symbol manipulating system. This ap- 
proach is favored by Anderson and Bower (1973) among others, but regarded with 
skepticism by Paivio. 
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We have remarked several times in this summary of the volume that the in- 
formational support for perceiving and acting consists of abstract invariants 
defined over time and further, that the kemal units for perceptual theory are 
kinetic events. If Paivio wishes to maintain that the perceptual knowledge 
which feeds his imagery system is continuous with perception, then we might 
wish to propose that perceptual knowledge is most appropriately characterized 
in terms of events — rather than static images — and cognately, in terms of dynam- 
ic abstract invariants. 

Our facility with metaphor provides a case in point. If I am requested to 
remember the sentence: "Rabbits are like children skipping rope down the side- 
walk" then an effective prompt at a later date is: "Kangaroos move like a bas- 
ketball being dribbled" (Verbrugge, 1975) . Why should this be so? It stretches 
the imagination to believe that the equivalence between the two sentences lies 
in semantic features common to rabbits, children, skipping ropes, kangaroos, 
and basketballs, or that it could be realized by compounding static images. We 
may conjecture that the two sentences share a common abstract invariant — period- 
ic up and down motion relative to the ground plane, and it is the detection of 
this invariant which determines their equivalence. 

We alluded above to imaging as perceiving from no particular station point. 
In a delightful mix of words Verbrugge (1975) remarks that: "Language is more 
like a piano score — an invitation to create meaning." In his perspective, the 
listener seeks structure among the virtual objects suggested by a sentence much 
as he seeks structure in the optic array — except in the linguistic case he does 
so from no particular station point. The suggestion is that the style in which 
we perceive language is not qualitatively different from the style in which we 
perceive or visualize the environment. Our guess is that if Paivio' s nonverbal 
and verbal systems conflate at all, it is not because they use a common propo- 
sitional format, but because they are both oriented to the abstract invariants 
which specify events. 

Let us pursue the verbal mode a little further. With respect to language 
perception by ear, there are three aspects of that perception that we might dis- 
tinguish. We can identify a semantic mode in which we experience the meaning 
of what we hear, a phonological mode in which we experience what we said dis- 
tinct from what it means, and an acoustic mode in which we experience certain 
nonlinguistic aspects of speech (cf. Halwes and Wyre, 1974). Paivio's remarks 
and our comments in the preceding paragraphs were directed at the semantic mode; 
the paper by MacNeilage focused on the phonological and the acoustic. 

MacNeilage's bone of contention is that perceiving in the phonological mode 
is qualitatively different from perceiving in the acoustic mode. More precisely^ 
MacNeilage takes issue with the claim that the underlying experiences of language 
at the phonological level are fundamentally articulatory processes. We may rec- 
ognize strong and weak versions of this claim. In the strong version, the pro- 
cesses responsible for phonological experience are identical to the neuromotor 
processes of articulatory coordination involved in speaking, but with the motor 
commands inhibited at some level prior to inducing mechanical muscular events. 
In the weak version, phonological e:icperience is constructed from the acoustic 
data by virtue of knowledge about what human vocal tracts can and cannot do. 
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The data often cited in support of the motor or articulatory theory of 
speech perception are no longer as compelling as they might once have been. 
Thus, one of the cornerstones of the theory, categorical perception, is now 
known to be indigenous to neither speech nor humans. Nevertheless, there. are 
some curious observations which point to an intimacy between perceiving and 
producing speech that cannot be dismissed lightheartedly. Among these we might 
include the tight coupling between hearing and speaking vowels witnessed by 
the exceptionally rapid shadowing of Chistovich's (1961) subjects, and a recent 
and provocative discovery compatible with the weaker version of the theory 
that has been made by Liberraan and Donnan (see Liberman, 1975). If two syl- 
lables such as /beb/ and /de/ are arranged very closely together in time, one 
of the stop consonants is "masked" so that the listener hears /be/ instead of 
/beb/. However, this perceptual impairment can be readily eliminated by having 
the two syllables spoken by two different vocal tracts: no matter how tem- 
porally proximate is the presentation of the two syllables, as long as they are 
produced by different vocal tracts they can be heard as separate phonological 
events. In the perspective of the weaker version of the articulatory theory, 
this result is interpretable in terms of the listener's tacit knowledge of vocal 
tracts which specifies that although the rapid transition from one stop conso- 
nant to the other is impossible for a single speaker, it can be achieved easily 
by two speakers . 

However, the thrust of MacNeilage's survey is not to be denied; there is 
relatively little to recommend a motor theory. The hypothesis that speech is 
perceived by reference to how it is produced is countered by the hypothesis 
that speech is produced by reference to how it is perceived, that is, the motor 
theory of perception is nullified by an acoustic theory of production. In view 
of the latter, we might not wish to regard either phonological perception or 
production as parasitic on the other, but rather, that perceiving speech and 
producing speech are related through an abstract structure that is common to both 
but indiginous to neither (Turvey, 1976). At least for the lowly cricket there 
is a suggestion that perception and production are manifestations of the same 
structure: a common gene might mediate the male's song and the female's per- 
ception of (Hoy and Paul, 1973). 

Perhaps the larger point to be made with respect to a comparison of percep- 
tion in the phonological and acoustic modes, is that nonphono logical auditory 
perception has not been treated fairly in theory and research. In studying the 
auditory perceptual system, insufficient weight has been given to its primary 
role of detecting environmental sources of mechanical disturbance. Ecologically, 
the role of audition is to identify the source of sound and the behavior of the 
identified source (cf. Schubert, 1975). The auditory perceptual system, like 
its visual counterpart, is oriented to events, but our understanding of auditory 
perception outside of speech, is based on the perception of sounds that are more 
nearly abstract than event related. 

Consider the common use of artificial sounds in the laboratory; examples 
are steady-state pure tones or steady-state short bursts of random noise. The 
most notable feature of the perception of sounds such as these is that they re- 
sist reliable identification (Pfafflin and Matthews, 1966; Webster, Woodhead, 
and Carpenter, 1970). In part, this seems to be owing to the fact that sounds 
relating to ecological events — the class of sounds to which the auditory percep- 
tual system has been attuned by evolution — involve rapid transients in intensity. 
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These transients are concomitants of the onsets and offsets of the mechanical 
disturbances to which the sounds correspond. In the absence of these transients, 
specification of the identity of the source of the sound is far from ideal (see 
Saldanha and Corso, 1964; Luce and Clark, 1965, 1967). 

Now speech perception is the perception of sound as modulated by articula- 
tory events. But the nonspeech perception with which it is often compared is 
the perception of sounds that have been stripped of ecological validity. A pure 
steady-state tone specifies no event whatsoever. The contrast between speech 
and nonspeech perception or linguistic and nonlinguistic perception is, in our 
opinion, more often a contrast between event perception and nonevent perception. 
Such being the case, speculation on how the perception of speech differs from 
that of nonspeech is premature. Imagine hearing a can or a dish fall to the 
ground. We can ask with Schubert (1975:102) "Was the can large or small; of 
heavy or light construction; was it in contact with a hard surface like concrete 
or an absorbent one like earth or grass? Did the dish shatter or bounce?" 
Conjecturally , we answer these questions ba^ed on the fact that the objects and 
substances involved, and their interactions, t::iodulate the acoustic array in 
specific and invariant ways. But whax do we know of such invariants and their 
detection? The answer, unfortunately-, is very little. Nevertheless, it is the 
character of this kind of auditory perception to which the character of speech 
perception should be compared. There is onG modest difference between the two 
kinds of perception which immediately comes to mind. Differentiating nonspeech 
environmental events probably takes full advantage of the exteroceptive exper- 
tise of vision; in contrast it is roughly apparent that vision's role in the 
differentiating of speech events is minimal. 

At this juncture let us anthologize our review and comments thus far. To 
the primacy of vision we have now added the primacy of abstract relational in- 
formation defined over time. The latter is meant to contrast with the more 
common attitude which asserts the primacy of nominal, punctate, and momentary 
entities in perception. Furthermore, we have promoted kinetic events as opposed 
to static retinal images or steady-state sounds as the ecological entities to 
which evolution has attuned perceptual systems and thus the proper departure 
point for theorizing. Admittedly this promotion does not reflect the bias of 
all of the authors of this volume but, ideally, our remarks have been sufficient 
to support our intuition that the event concept provides a unifying theme. 

We consider now the remaining two papers, those of Trevarthan and Halliday. 
If the papers discussed thus far can be categorized as papers directed to the 
what and the how of perception, that is, to the issues of what there is to be 
perceived and how it is perceived, then those of Trevarthan and Halliday may be 
categorized as papers directed to the who of perception — the epistemic agent or 
algorist (Shaw and Maclntyre, 1974). As Shaw remarks, the questions of the "what," 
the "how," and the "who" of perception form a closed set of questions with answers 
to any one co implicating answers to the other two. It is fitting, therefore, that 
the final papers in this volume emphasize the thus far omitted member of the above 
triad. 

Briefly, Trevarthan' s major points are these: first, that psychologists 
are insufficiently sensitive to the implications of anatomy — particularly the 
somatotopic principal—for perception and action theory; second, that perceptual 
systems should be considered in the light of mechanisms for action; and finally, 
that contrary to time-honored claims, infant behavior is intentional. This last 
point is also the larger point of Halliday' s essay. 
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The organization of the vertebrate midbrain provides an instructive example 
of both the first and second points. If a map is drawn of the projection from 
the eyes to the midbrain tectum in the coordinates of the eye, then for animals 
with frontal-oriented eyes and animals with lateral-oriented eyes, the two maps 
are quite dissimilar. However, if the maps are drawn in the coordinates of the 
behavioral field, that is with respect to the asymmetry of the body, we would 
observe that the two maps are virtually identical. As a general principle, the 
mapping from eyes to tectum in the coordinates of the behavioral field is rela- 
tively invariant; and this mapping of visual loci also maps a topography of 
points of entry into the action system. 

The confluence between seeing and doing was highlighted earlier in Lee's 
paper and that between hearing and speaking was critically examined in MacNeilage* s. 
A further, though brief comment on the perception-action relation is warranted. 
The problem of coordination is the problem of controlling the enormous number of 
degrees of freedom that the biokinematic links — the skeletomuscular hardware — 
can attain (Bernstein, 1967). In view of the indeterminancy of the peripheral 
motor apparatus, it is most unlikely that executive processes coordinate movement 
through the individual control of each degree of freedom. In short, action plans 
are probably not written in terms of individual muscle contractions. The alter- 
native view(Gelfand, Gurfinkel, Tsetlin, and Shik, 1971) is that action plans 
are written in terms of muscle linkages, that is, muscle-joint complexec whose 
activities covary and whose kinematic characteristics are similar. Such linkages 
may be referred to as coordinative structures (Turvey, 1976). The role of these 
structures is to reduce the degree of freedom requiring control, for a coordina- 
tive structure behaves quasi-autonomously and therefore, from the perspective 
of an executive procedure it represents but a single degree of freedom. Coordi- 
native structures provide only a partial solution to the problem of degrees of 
freedom. In the performance of acts the degrees of freedom are regulated with 
precision, but an action plan is by necessity crudely specified in the language 
of coordinative structures. We therefore ask: How are movements performed that 
are precise in their timing, velocity, and displacement? Obviously perception 
must modulate unfolding action plans, but in order to do so perceptual infor- 
mation must be parsed in ways compatible with the nested components of the evolv- 
ing act and must be injected into the action system at the right place and at 
the right time. How this is done is not at all apparent, but we may regard 
Trevarthan's comments on somatotopic organization and on the contrastive capa- 
bilities of focal and ambient vision as preliminary steps in the direction of 
an answer. 

Let us conclude our summary of this volume with the shared insights of 
Trevarthan and Halliday on the nature of infant behavior. An appropriate back- 
drop is provided by a brief consideration of Gibson's shift away from percep- 
tual psychophysics. In common with his predecessors, Gibson in his early writings 
(Gibson, 1950) adopted the causal chain theory of perception; perceptual experi- 
ence was caused by stimuli. However, as he developed the concept of the optic 
array it became evident to him that the formulation "stimuli trigger perception" 
was incorrect and that a more judicious formulation was that "the ambient optic 
array supports the regulation and coordination of activity." The significance 
of the reformulation is that it emphasizes exploration and selection with the 
animal as agent rather than the animal as reactant . 

Suppose that we do adopt an agent or algoristic oriented view of the rela- 
tion between what there is to be perceived and how it is perceived. Do we mean 
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to hold to this view for all stages of ontogeny? Popular scientific and not-so- 
scientific opinion would most likely respond **no." For the agentlike qualities 
of the adult perceiver/actor are said to result from a lengthy apprenticeship; 
the infant human reacts to stimuli in the ageless story^ and only comes to plan 
and regulate his behavior with respect to information after the slow process of 
enculturation. The contrary and, perhaps, radical claim of Trevarthan and 
Hallid'ay is that the infant is inherently purposive. What we witness in 
Trevarthan' s and Halliday's behavioral and protolinguistic analyses of infant 
line, is the infant as algorist possessing and deploying a stock of fundamental 
strategies or modes for selectively operating upon the world. The disposition 
of these strategies rests on the capacity to distinguish between animate and 
inanimate objects which afford different possibilities of interaction. The in- 
fant coiimunicates vocally and gesturally with animate objects, but reaches for 
and manipulates inanimate objects. We learn from Halliday that the inchoate 
vocalizations of early childhood are actually basic acts of meaning, intended 
in part, to procure material ends and to maintain contact with and regulate the 
behavior of those who enter into the communication scenario. To the claim that 
the infant is inherently agentlike we add the claim that the infant is inherently 
social. 

NOTES 

''Mode" has many synonyms of which "style" and "fashion" are perhaps the 
most common. We speak about this and that style of dress and we will often pass 
comment on how fashionable or unfashionable a given style happens to be. Such 
comment is intended to relate the style in question to the context of contem- 
porary living. It is a matter of whether the style is compatible with some 
broader context of constraints, although the criteria for adjudicating on this 
subject are rarely unequivocal. 

Fashionableness is a passing quality although there are no fixed time limits 
on a style's period of grace. Nevertheless, it is fair to claim that the lon- 
gevity of a style of dress is considerably shorter than that of other styles, such 
as the style of eating. Other styles are even more perpetual; the style of human 
locomotion, for example, has undergone relatively little change. 

Styles, therefore, may be said to lie on a continuum from persistent to 
transient, and we additionally propose, from itnmutable to docile. Consider a 
further aspect; given several styles of dress, a person cannot be dressed in 
more than one style at a time. In short, different styles of dress are mutually 
exclusive. Styles are also said to be stereotypic, invariant ways of doing 
things. A not uncommon reproach of haute couture by those excluded is that 
they — the in--crowd — all dress or act alike. The epithet stereotypic must be 
handled cautiously, for its use is likely to blind one to the important fact 
that to be in style does not mean that one is a carbon copy of one's comrades 
in fashion* Rather, one's dress differs perceptibly from that of the others 
in ways which do not violate the prescribed, although often ineffable, conven- 
tions. We may say, therefore, that to be in a style is to be in a certain ball- 
park of states. We will proceed to define a style as a set of constraints which 
ensures the realization of an invariant condition over variable instances . Un- 
fortunately, equating style and constraint is not a simple way in which to clas- 
sify styles. Constraints — and thus, by definition, styles — vary on a scale from 
f 
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light to severe, with the severity of a constraint measured by the reduction it 
causes in the number of possible configurations » that is> the extent to which 
it freezes degrees of freedom. 

Consider the relation between style of dress and style of dance. We have 
remarked already that I cannot be in two styles of dress simultaneously as one 
excludes the other. Similarly I cannot be dancing in two styles simultaneously. 
Nevertheless, I can be in a style of dress and dance in a certain style at the 
same time if one of two conditions exists. First, when my style of dance and 
my style of dress do not affect one another, as is the case when my style of 
dress does not restrict my movements, then I am perfectly able to do a certain 
dance while in a certain style of dress. Second, when my style of dress does 
restrict my movement in some particular way I can still perform a certain dance 
if the dance constrains my movements in the same way as my style of dress. For 
example, one can do the currently popular hustle while wearing platform shoes, 
since the constraint on bending one's foot is the same for the hustle as for 
platform shoes. However, an Irish jig and platform shoes are not compatible, 
since the constraint on bending one's feet imposed by platform shoes is not 
compatible with dancing the jig. 

Speaking more generally, two or more styles are compatible (that is, they 
can coexist) if Q-) they govern different degrees of freedom or (2) the y selec- 
tively freeze the degrees of freedom which they have in common in the same way . 

Returning to our dress-dance metaphor, we intuit that when neither of the 
above conditions is satisfied, styles behave in a coalitional (free-dominance) 
fashion. That is to say, styj.es are not organized in a strictly hierarchical 
manner. Any one style may take precedence over any other style, depending on 
the event in which the two styles take part. Thus,^.I may be intent upon wearing 
my platform shoes in which case I modify the jig so that I do not bend my feet; 
or, I may be intent upon doing the jig correctly, so I take my platform shoes 
off and dance in my bare feet. 

Substituting the term mode for that of style, we may summarize as follows: 
a mode is a set of constraints which guarantees the realization of an invariant 
condition over variable instances; such sets of constraints may range from tem- 
porary to permanent and from flexible to unchangeable; two or more such sets 
of constraints may operate simultaneously if certain conditions prevail; gen- 
erally, the organization of modes is coalitional. 

In terms of the preceding, we may approach the question of how mode in 
psychology is to be understood by asking: What constraints are operating when 
an occasion of perception — a perceptual condition — is labeled as an instance 
of this or that mode? Ideally, we seek to identify those constraints that are 
both necessary and sufficient for applying a mode label. As a rough strategy, 
we can ask initially what constraints are necessary and then inquire whether 
they are also sufficient. 

Reference has been made in this volume to a speech mode and a nonspeech 
mode, and, in the case of vision, to a focal mode and an ambient mode. It is 
roughly apparent that the constraints governing the information available to 
a perceiver (that is, what there is available for the animal to perceive) are 
necessary for defining a given mode. However, it is also roughly apparent that 
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those constre cs are not sufficient (in and of themselves) for the application 
of a unique moee label in each particular situation where those constraints oc- 
cur. Indeed, we argue that while the set of constraints corresponding to the 
set of answers to the question " what is the animal perceiving?" is necessary 
for the application of the label for a given mode, it is not sufficient. As 
a case in point, a musical sequence is easily recognized as such even when the 
notes of a melody are presented as a speech signal. It has been shown that 
when the fundamental frequency of a melodic line is inflected on a high quality 
synthesized syllable (tea) , indices for a "nonspeech mode" are obtained even 
in the presence of overall conditions of stimulation normally associated with 
a "speech mode" (Darwin, 1969). 

Perhaps we should look at the set of constraints governing how information 
is processed as well as at the constraints on what is processed. In this way, 
we circumvent the problems caused by attempting to define mode strictly in terms 
of informational constraints. For example, it has been shown that indices for 
a nonspeech mode can be obtained with a natural speech stimulus if the percep-- 
tual task is a nonlinguistic one (Haggard and Parkinson, 1971). Apparently, 
a given input is processed in a different way when the nature of the perceiver's 
task changes. A less equivocal example is provided by the following experiment. 
When "0" is embedded in a list of digits it can be found more rapidly if the 
observer is told that he or she is looking for a letter, than if he or she is 
told that the target is a digit. Conversely, when "0" is a member of a list of 
letters, latency of search is considerably shorter if one is looking for a digit 
zero than if one is looking for the letter "oh" (Jonides and Gleitman, 1972). 



We see from the above examples that the set of constraints governing how 
information is processed is by necessity linked with the intent of the perceiver 
[the epistemic. who , as defined by Shaw and Mclntyre (197A) ] , as well as with 
what information exists in the surrounding medium. An illustration of the co- 
implicative relations among the what , the how , and the who of perception is 
provided by the hermit crab's "attitudes" toward a sea anemone. The descrip- 
tion of these attitudes is due to von UexkUll (1957). To preface, let us 
identify the what of perception as the valences (see Gibson, 1966) specified 
in the ambient optic array structured by the sea anemone; the how of percep- 
tion as the exploratory and performatory measures taken by the crab in de- 
tecting and exploiting the different uses of the sea anemone; and the who of 
perception as the intents of the crab. In the first case, the hermit crab 
has been robbed of the actinians which it normally carries on its shell. These 
actinians serve to protect the crab from its enemy, the cuttlefish. In this 
case, the crab is described as assuming a "defense tone," and it plants the sea 
anemone on its shell. In the second case, the shell has been taken from the 
hermit crab, and the crab attempts often unsuccessfully to crawl into the sea 
anemone, the crab having assumed a "dwelling tone." Finally, the crab who has 
been left to starve for some time, assumes a "feeding tone" and proceeds to 
devour the sea anemone. Thus, if "defense," "dwelling," and "feeding," are 
mode labels it would seem that answers to each of the what , how , and, who ques- 
tions are necessary for the application of one of the mode labels, andvfurther, 
that answers to all three questions are sufficient for the application o'f a 
unique "mode label" in each particular situation. 

In these notes we have attempted, in a most elementary and approximate man- 
ner, to sketch the met a theory of modes. To this end we pursued the general 



concept of style, teasing from it several principles that we hoped might prove 
useful to the understanding of the more specific concept of mode in perceptual 
theory. Of these principles, the most fundamental equates mode with a set of 
constraints. We were motivated to ask whether, in defining a mode, the infor- 
mation for perception exhausted all the constraints, or whether the information 
for perception together with the algorithms for its analysis exhausted all the 
constraints. Our tentative answer to both of these questions is no. Unfor- 
tunately, that which appears to provide the full complement of constraints 
defining a mode is not some ling that we understand very well at all — namely, 
the relation among the what , the how , and the who of perception. It is our 
hunch that an appreciation of the aforementioned relation is the proper depar- 
ture point for a rigorous analysis of modes of perceiving. 
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Discrimination of Intensity Differences Carried on Formant Transitions Varying 
in Extent and Duration* 

+ -H- 
James E. Cutting and Michael F. Dorman 



ABSTRACT 

Dorman (1974) found that small intensity differences carried on 
the initial portions of consonant-vowel syllables were not discrimin- 
able. Similar differences carried on steady-state vowels and on iso- 
lated formant transitions, however, were readily discriminable. He 
interpreted the difference between the first and latter conditions as 
a phonetic effect. Using sine-wave an^ilijg?: to Dorman's stimuli, 
Pastora, Ahroon, Wolz, Puleo, and Bct>js/ (1975) found similar results. 
They concluded that the effect is not jfrlionetic, and that it is attrib- 
utable to simple backward masking. The present studies observed the 
discriminability of intensity differences carried on formant transi- 
tions varying in extent and duration. ' Results support the conclusion 
of Pastore et al. (1975) to the extent that the effect is clearly not 
phonetic. However, these results and others suggest that simple 
peripheral backward masking Is not a likely cause; instead, recogni- 
tion masking may be involved. Moreover, the finding that phonetic- 
like processes occur elsewhere in audition does not necessarily im- 
pugn the existence of a speech processor; phonemic and phonological 
processes remain, as yet, unmatched. 

Perhaps the most impressive characteristic of speech perception is the 
efficiency of information reduction (Liberman, Cooper, Shankweiler, and Studdert- 
Kennedy, 1967) . The speed and ease of phonemic segmentation is reflected in the 
rapid transformation of a 40,000 bit/sec acoustic signal into a 40 bit/sec 
phoneme string (Liberman, Mattingly, and Turvey, 1972), suitable for consider- 
ably further savings by conversion into higher-order, meaningful linguistic ele- 
ments. One empirical manifestation of this process is categorical perception , a 
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phenomenon in which phonetic properties of a syllable are rapidly extracted and 
separated from the acoustic waveform. In a discrimination task, acoustically 
different stop consonants that are labeled the same are typically perceived to 
be identical. Stops labeled as different, on the other hand, even though they 
may differ physically by the same amount, are readily perceived to be dissimilar 
(Liberman, Harris, Hoffman, and Griffith, 1957; Mattingly, Liberman, Syrdal, 
and Halwes, 1971; Pisoni, 1971, 1973). For example, acoustic information about 
trajectories of formant transitions — information that contributes directly to 
the phonemic percept — cannot be retrieved readily from sensory memory. 

Dorman (1974) found that phonemically irrelevant acoustic information also 
cannot be retrieved from sensory memory. He found that intensity differences 
carried on formant transitions of consonant-vowel (CV) syllables were largely 
undetectable. However, the same differences were eminently detectable when 
carried on steady-state vowels or on formant transitions isolated outside the 
syllable cbiitext. It appears that information-reduction mechanisms relevant for 
speech do not distinguish between phonemically relevant and irrelevant informa- 
tion at this level. This is as it should be. Liberman et al. (1972:323), for 
example, suggest "that the distinction between speech and nonspeech is not made 
at some early stage on the basis of general acoustic characteristics," but rather 
after many speech-relevant processors have been polled for proper speechlike 
features. In other words, both phonemically relevant a.nd irrelevant auditory 
signals share some, probably many, early processing stages. This view is sup- 
ported by the results of a recent study (Pastore et al. , 1975) which show that 
intensity differences carried on frequency ramps before steady-state sine waves 
are as difficult to discriminate as intensity differences carried on formant 
transitions of CV syllables. 

Dorman' s (1974) earlier account of the inability to discriminate intensity 
differences on formant transitions is incorrect. He noted the similarity be- 
tween the poor dj scriminability of intensity differences on formant transitions 
and poor discriminability of formant frequency within a phoneme c.ategory. Both 
effects were attributed to the uniquely categorical, linguistic processing 
accorded stop consonants: " After the acoustic cues for stop consonants have 
been recoded into a phonetic [categorical] representation , all of the acoustic 
information is stored in a, relatively inaccessible short-term auditory memory" 
(Dorman, 1974:86, italics added). The effect, however, is not necessarily the 
result of linguistic coding, since categorical perception occurs in several non- 
linguistic domains (Cutting and Rosner, 1974; Cutting, in press; Cutting, Rosner, 
and Foard, in press; Miller, Wier, Pastore, Kelly, and Dooling, in press; see 
also Locke and Kellar, 1973; Lane, 1965, 1967). Moreover, it does not appear 
contingent on categorical perception or phonemic processing at all, since the 
stimuli of Pastore et al. (1975) are likely neither to be perceived categorical- 
ly (see Pisoni, 1971:Experiment II) nor phonemically (see Cutting, 1974: 
Experiment III) . 

Pastore et al. (1975) noted another problem with Dorman' s account of his 
results. They suggested that to change the carrier waveform from a CV syllable 
to a steady-state vowel syllable, as Dorman did, is to change the task at the 
same time from one of simple backward-masking detection to one of pedestal de- 
tection (see Tanner, 1958; Tanner and Sorkin, 1972). We concur that formal 
parallels are unmistakable between pedestal detection and the detection of in- 
tensity differences carried at the beginning of the vowels. Thus, Dorman' s 
steady-state vowel control does not appear to eliminate simple backward masking 
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as a cause for poor discrimlnability of intensity differences carried on CV 
syllables: pedestal detection experiments appear to be a special kind of mask- 
ing experiment* 

Several important questions about masking arise. First, how might backward 
masking function in speech perception? For example, if phonemically irrelevant 
information can be masked at an auditory level, why is it that phonemically 
relevant information is not masked as well, rendering speech incomprehensible? 
Second, Pastore et al. (1975) do not suggest a particular relationship between 
backward-masking detection and pedestal detection tasks. For example, do the 
two tasks differ in degree or in kind? Should we expect intermediate detect- 
ability for speech syllables whose transitions are midway between those of a CV 
and a steady-state vowel? Or should we expect that all syllables with transi- 
tions, regardless of their extent or duration, would inhibit detection of inten- 
sity differences since only the steady-state vowel stimulus meets the requisite 
of having a true pedestal? Experiment I explores the detectability of intensity 
differences carried on the formant transitions of these intermediate stimuli. 
The discussion and Experiment II, which follows thereafter, explore the plaus- 
ibility of simple backward masking versus backward recognition masking as a 
cause of our results. 

EXPERIMENT I 

Method 

Two arrays of three-f ormant speech stimuli were generated on the Raskins 
Laboratories parallel-resonance synthesizer. One array consisted of six items 
differeing in the extent of formant transitions, with all items identifiable as 
/ba/ or /a/; the other array consisted of five items differing in duration of 
formant transitions, with all items identifiable as /ba/ or /bwa/. All stimuli 
were 300 msec in duration and had a flat pitch contour of 100 Hz. Steady-state 
/a/ resonances for both arrays centered on 769, 1232, and 2525 Hz for first, 
second, and third formants, respectively. The six-item /ba/-to-/a/ array con- 
tained stimuli whose formant transitions were 60 msec in duration. Transitions 
decreased in extent by equal increments over this array, in corresponding fash- 
ion for all three formants. Stimulus 1 (the prototype /ba/) transitions began 
at 513, 846, and 2180 Hz for the three formants, respectively; and Stimulus 6 
(the steady-state vowel /a/) began with formants of 769, 1232, and 2525 Hz. 
Intermediate stimuli had intermediate starting frequencies for each formant. 
The five-*item /ba/-to-/bwa/ array contained stimuli whose formant transitions 
always began at 513, 856, and 2180 Hz, but whose transition durations lasted 40, 
60, 80, 100, and 120 msec for Stimuli 1 through 5, respectively. The endpoint 
stimuli of both arrays are shown schematically in the top panels of Figure 1. 
Stimuli were digitized and stored on disc file using the pulse code modulation 
system at Haskins. Further stimulus alteration consisted of decreasing the 
initial portions of all stimuli by 0, 4, and 8 dB. For the /ba/-to-/a/ array 
the decreased portion was always 60 msec in duration (like that used by Dorman, 
1974), and for the /ba/-to-/bwa/ array it was held to the duration of the for- 
mant transitions: 40, 60, 80, 100, or 120 msec. In this manner each of the 
eleven stimuli was sjnathesized in three renditions. For an indication of over- 
all amplitude envelope shape, of these stimuli see Dorman (1974:Figure 2). 

Four diotic stimulus sequences were recorded on audio tape; one identifica- 
tion sequence consisted of random orders of the standard (0 dB) stimuli, 48 and 
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40 items respectively, for the extent and duration stimuli. Each item in each 
array appeared eight times. The interval between each item in both sequences 
was 3 seconds. Listeners wrote down BAH or AH, and BAH or BWAH to identify mem- 
bers of the arrays. Discrimination sequences consisted of 90 and 75 AX trials 
for the /ba/-to-/a/ and /ba/-to-/bwa/ arrays: (6 and 5 stimuli in the arrays, 
respectively) x (3 intensities to be discriminated: 0-, 4-, and 8-dB differ- 
ences between members of the AX pair) x (5 observations per pair). Each dis- 
crimination trial began with a 100 msec 1000 Hz warning tone, followed by 500 
msec of silence, followed by Stimulus A, another 500 msec silent interval, and 
Stimulus X. Stimulus A was always the standard stimulus, whereas Stimulus X 
had formant structures identical to Stimulus A but with its initial portions 
attenuated by 0, 4, or 8 dB. There was a 3.5 second interval between the offset 
of Stimulus X and the onset of thewarning tone for the subsequent trial. Listen- 
ers wrote down S^ for same if they thought the AX items were identical, and D. for 
different if they were not. 

Thirteen Wesleyan University students listened as a group to the four se- 
quences as part of a course project. All were native American English speakers 
with little experience at listening to synthetic speech. They listened to the 
audio tapes played on a Crown CX-822 tape recorder, broadcast in a quiet room 
over an Ainpex AA-620 loudspeaker. All listeners sat between 8 and 18 feet from 
the loudspeaker, which for the standard item delivered approximately 75 dB SPL. 

Results 

All results are shown in the lower panels of Figure 1. In the left-hand 
panel, identification functions for /ba/ and /a/ are superimposed on two dis- 
crimination functions, those for judgments of 4- and 8-dB differences. Stimuli 1 
through 5 were consistently identified as /ba/, and only Stimulus 6 was identi- 
fied consistently as /a/. The identification "boundary" appears to be located 
near Stimulus- 5, where the two complementary identification functions cross. 
Discrimination functions (percent correct discrimination of intensity differ- 
ences at each comparison) show that 8-dB judgments were consistently more suc- 
cessful than the 4-dB judgments [F(l,144) = 65.1, £ < .001]. There was no inter- 
action of intensity with stimulus location along the array; therefore, collapsing 
across the two intensity differences, there was a significant increase in dis- 
criminability as the formant transitions decreased in extent [F(5,144) = 3.15, 
£ < .025]. Moreover, a trend test (Winer, 1962:132) proved this increase to be 
linear [F(l,64) = 49.3, £ < .001] with ao significant quadratic, cubic, or other 
higher-order components. The D^ responses on AA trials (those with 0-dB differ- 
ence) were scored as false alarms, and the detectability of the intensity dif- 
ferences was then assessed independent of possible response bias. A generally 
linear increase was obtained: the d.' scores for 4-dB judgments were .44, .60, 
.84, 1.10, 1.08, and 1.15; and those for 8-dB judgments were 1.61, 1.56, 1.89, 
2.20, 1.92, and 2.09, respectively, for the six different transition, extents. 

Results for the /ba/-to-/bwa/ array are shown in the lower right-hand panel 
of Figure 1. Identification functions are somewhat unimpressive: only. Stimulus 
1 was consistently identified as /ba/ and, where as Stimuli 3 through 5 were 
primarily identified as /bwa/, none was so identified with a consistency exceed- 
ing 72 percent. The identification "boundary," if one can be said to exist, 
appears to be near Stimulus 2. The pattern of discrimination results followed 
very closely that for the previous set of stimuli. Again, 8-dB judgments were 
superior to 4-dB judgments [F(l,120) = 58.9, £ < .001]; discriminability 
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increased across the stimulus array [F(4,120) = 7,9, £ < .001], and that increase 
was linear [F(l,51) = 73.0, £ < .001] without significant higher-order compon- 
ents. This linear pattern was repeated in terms of detectability: 4-dB d^' 
scores were .24, .45, .84, 1.26, and 1.28; and 8-dB scores were 1.60, 1.70, 
1.96, 2.12, and 2.22, respectively, for the five different transition durations. 

Discussion 

Two aspects of our results support the primary conclusion of Pastore et al. 
(1975): the inability to detect intensity differences carried on the formant 
transitions of stop consonants is a psychoacoustic rather than phonetic effect. 
First, there is no abrupt increase in detectability of intensity differences as 
the stimulus arrays change from /ba/ to /a/ for those stimuli differing in extent 
of transitions, and from /ba/ to /bwa/ for those differing in duration of 
transitions. If the availability of acoustic information were somehow inhibited 
by the processing of the highly encoded stop consonant in particular, one would 
have expected a quantal increase in discriminability in the /ba/-to-/a/ array at 
about Stimulus 5. Clearly none exists, and thus the effect cannot be directly 
related to categorical perception. Studdert-Kennedy, Liberman, Harris, and 
Cooper (1970), among others, would predict discontinuities in the discrimination 
functions at this point if the phenomenon were related to categorical percep- 
tion. Second, the increase in discriminability is linear for both arrays. Such 
linear increases are also at variance with the nonlinear, categoricallike pro- 
cesses associated with phonetic perception. 

Our results demonstrate interaction between rate of frequency change and 
the discrimination of intensity change on formant transitions. That is, for the 
/ba/-to-/a/ array in particular, the less frequency change that occurs, the more 
perceptible the intensity differences become. Thus, frequency and intensity 
appear to be yoked in the percept and contribute in an interactive manner to the 
traces available to short-term auditory memory. Of course, as Pastore et al. 
(1975) admit, finding a psychoacoustic basis for the inability to detect such 
intensity differences here, does not rule out the possibility that a similar 
outcome could result from processes occurring at other levels. In visual mask- 
ing, for example, Turvey (1973) demonstrated that when viewers were unable to 
report a target, the contour information may have been masked peripherally or 
centrally. At both levels, the effect is similar: viewers are unable to iden- 
tify the target.. 

A Second Look at Simple Backward Masking 

The secondary conclusion of Pastore et al. (1975), that these results are 
caused by simple backward masking, is more suspect. While they do not mention 
these issues, the type of phenomenon they refer to appears to be threshold 
masking rather than recognition masking (Massaro, 1973, 1975). The locus of the 
backward masking appears to be peripheral not central, and it appears to result 
from target-mask integration, not interruption (see Kahneman, 1968;. and Turvey, 
1973, for arguments with respect to vision). From this view of masking one 
might not expect to find evidence in any experimental paradigm of the ability to 
detect 4 to 9 dB intensity differences carried on formant transitions. That is, 
this information would be buried in background noise considerably prior to the 
decision making process. There are several reasons to suspect, however, that 
the intensity information in the Dorman (1974) and present studies is not lost 
by simple, peripheral target-mask integration. 
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yllables, as opposed to those on steady-state vowels. Direct comparisons are 
ifficult: (a) since Dorman used attenuations of 7.5 and 9.0 dB, whereas we 
sed attenuations of 4 and 8 dB, (b) since Dorman used the carrier stimuli /bae/ 
nd /ae/ whereas we used /ba/ and /a/, and (c) since Dorman 's listeners heard his 
timuli through earphones, whereas we played them over a loudspeaker in a rever- 
erant room. Nevertheless, a striking trend can be seen when d^' scores for his 
timuli are compared with those for Stimuli 1 and 6 from the /ba/-to-/a/ array. 

In the present study, by mixing the CV and V stimuli together with several 
citermediate items, the detectability of the intensity differences carried on 
he CV syllables increased considerably. It decreased, on the other hand, for 
hose differences carried on steady-state vowels. It would appear, then, that 

large proportion of the effect is attributable to context, not to masking, 
hat is, detectability varies according to previous experience and expectations 
ithin the experiment. The difference in detectability for intensity differ- 
nces in CV and V syllable changed from a standard score of more than 3.2 (for 
orman's 9-dB discriminations) to one of less than ,5 (for our 8-dB discrimina- 
ions) . Such a finding appears to be at variance with the hypothesized effect 
f simple peripheral masking, and suggests that: (a) the intensity information 
s available at some level of perceptual analysis and that (b) recognition mask- 
cig rather than threshold masking may be involved in the Dorman (1974) and 
astore et al. (1975) results. 

A second avenue of reasoning comes from the many studies of categorical 
2rception of stop consonants, and the fate of within-phoneme-category formant 
requency information. In ABX (Liberman et al. , 1957), odd-ball (Mattingly 
t al., 1971), and AX (Pisoni, 1971, 1973) paradigms, the discrimination of fre- 
jency differences carried on formant transitions has been found to be categor- 
zal — that is, the frequency difference in formant transitions within the same 
rionemic category is discriminated at about chance, while the frequency differ- 
ice across categories is discriminated very easily. Despite essentially chance 
Lthin-category preformance, frequency information is netiher masked in auditory 
rocessing nor lost in the auditory-to-phonetic transformation (see Barclay, 
972; Pisoni and Lazarus, 1974). Pisoni and Tash (1974), for example, have 
lown that "same** reaction times (RTs) to physically different but phonemically 
ientical stop consonants are slower than "same**. RTs to physically identical 
top consonants. Thus, even though the discrimination response implies that the 
4o signals were perceived identically, and by. inference that there was no 
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the items to be identified properly. Thus, the whole of the stimulus is necessary 
for perception, and this fact would suggest a holistic mode of processing. 

It may be more accurate to account for the present data based on the stimu- 
li's acoustic nature rather than in terms of processing strategy. One particu- 
Q rly important acoustic property of the plucked and bowed stimuli in terms -of 



distinguishing information left about formant trajectories, the RTs indicate 
that at some level in the nervous system the information was present. We would 
expect a similar outcome in an RT analysis with the signals used in the present 
study. That is, we suspect that the "same" RTs to the physically different 
(4 dB) signals would be slower than the "same" RTs in the physically identical 
(0 dB) condition. Experiment II was conducted to test this hypothesis. 



EXPERIMENT II 



Method 



Two stimuli were selected from Experiment I: Stimulus 1 (/ba/) and Stimulus 
6 (/a/) from the array with transitions differing in extent. Both were generated 
in three renditions: the initial 60 msec was attenuated by 0, 4, and 8 dB. One 
discrimination sequence was assembled exactly as in Experiment I. It contained 
120 AX trials: (2 stimuli) x (3 intensities to be discriminated) x (20 observa- 
tions per item). Listeners pressed, as rapidly as possible, one of two tele- 
graph keys to indicate whether the two items within a trial were the same or 
different. Reaction times were fed on line into a PDP-11 computer for analysis. 
They were measured from the onset of the second item to the onset of the key- 
press . 

Four students and staff members at Haskins volunteered for the experiment. 
All were naive to the purposes of the task. They listened, in groups of two, 
to stimuli reproduced on an Ampex AG-500 tape recorder and transmitted binaural- 
ly through a listening station to Telephonies headphones (TDH-39) . 

Results and Discussion 

The most important reaction time results are shown in Table 2 — mean RTs 
for "same" responses for the 0-dB and 4-dB discriminations. Few "same" re- 
sponses were made for 8-dB trials, so they are not included. The difference in 

TABLE 2: Mean reaction time (and number) of "same" responses to intensity dif- 
ferences carried on the initial 60 msec of CV syllables. Maximum 
number of trials per cell is 20. ■ 

Intensity difference 

Listener 0 dB 4 dB 

T.B. 659 (19) 942 ( 9) z_ - 3.02 .002, one-tailed 

P.B. 609 (16) 694 (11) _z = 1.19 .12 

W.F. 612 (17) 818 (8) z_ = 1.92, £ < .03 

H.S. 669 (18) 1005 (4) z = 2.89 £ < .002 

Mean of 

means 637 865 



RTs for the two conditions ranged from 85 and 336 msec for the four listeners; 
the results for three listeners were statistically robust by a Mann-Whitney U 
test on individual reaction times, while those for the other listener approached 
significance. (U scores were converted into standard units, as shown in 
Table 2.) These results clearly indicate that intensity information not 
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discriminated on a particular trial is not masked in absolute terms, but is 
represented in some form throughout the information processing system. The rep- 
resentations for the two stimuli within a trial that differ in the amplitude of 
their onsets of 4 dB are more difficult to match than are those pairs with the 
same onset amplitude. These results are congruent with those of Pisoni and 
Tash (1974) using speech syllables, and with prior results of Emmerich, Gray, 
Watson, and Tanis (1972) using nonspeech stimuli. 

CONCLUDING DISCUSSION 

The results of the present studies suggest, first, that the relationship 
between pedestal-detection and recognition-masking experiments is one of degree 
rather than kind. There is no discontinuity between the two. Second, the re- 
sults support the primary conclusion of Pastore et al. (1975): the relative in- 
ability to discriminate intensity differences carried on the formant transitions 
'of CV syllables, as compared to those carried on the initial portions of steady- 
state vowel syllables is an effect that is psychoacoustic rather than phonetic. 

Third, our results demonstrate differences between types of masking. 
Pastore et al. (1975) appear to attribute the inability to discriminate differ- 
ences .carried on formant transitions to simple backward masking. Simple masking, 
^according to Licklider (1951), is the opposite of analysis. Information is 
simply not processed, and the implication is that masked information is irre- 
trievably buried in background noise. However, results of Experiment II show 
that phonemically irrelevant acoustic information remains accessible to the 
listener in some form. This suggests that recognition masking is the phenomenon 
involved in the Dorman (1974) and Pastore et al. (1975) experiments and Experi- 
ment I of the present investigation. Moreover, recognition masking is selective 
in its effect on auditory versus phonetic memory codes. Fourth, a comparison of 
the detectability scores from Dorman' s (1974) study and those from Experiment I 
also suggest that this information is not masked absolutely even in recognition 
terms, but may be used or unused as a function of context in an experimental 
session. 

On the "Speech Processor" 

Pastore et al. (1975) suggest that a speech processof: is an unneeded con- 
struct to account for results in the AX discrimination taisk. We agree. Never- 
theless, whereas our results support this position, we must not ignore the neces" 
sity for some such device at some level . The level at which any device is 
specific to speech is currently a crucial question. Several effects thought to, 
diitnonstrate the psychological reality of phonetic processing (Wood, 1975:16) 
have been found to occur in pur-sly auditory domains (Cutting and Rosner, 1974; 
Blechner, Day, and Cutting, 1976; Pastore, Ahroon, Puleo, Crimmins, Galowner, 
and Berger, 1976; Cutting, in press; Cutting et al., in press; Miller et al. , 
in pr<iss). Thus the mechanism that extracts phonetic information from the speech 
signal may be the same device that is used elsewherej, for example, in the pro- 
cessing^, of musiclike sounds. In other vordSj phoneticlike processing may not be 
speech-specific processing. Yet these rcicent findings cut only into the lowest 
tier of the speech- language hierarchy — thet of phonetic processing. The percep- 
tion of different allophones of the same phoneme as being the same — such as the 
/p/s in pit, spit 3 and _tip — and the parsing of syllables from a continuous 
speech stream seem to be processes without nonspeech analogs. Unless (or until) 
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analogs are found, the notion of a speech processor is not impugned by the ex- 
istence of phoneticlike processes elsewhere in audition. 
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Discrimination Functions Predicted from Categories in Speech and Music* 
James £• Cutting and Burton S. Rosner 



ABSTRACT 



Cutting and Rosner (1974) reported that sawtooth waves varying 
in rise time and identifiable as either plucked or bowed are per- 
ceived categorically according to the strictest criteria. The pre- 
dicted discrimination functions in that paper were incorrectly cal- 
culated. This note gives correct formulae and the predictions that 
they yield. The original finding is unchanged. 

Sawtooth waves differing only in rise time are identifiable as plucked or 
bowed notes from a stringed instrument. We previously reported (Cutting and 
Rosner, 1974) that these nonlinguistic sounds are perceived categorically. 
We also synthesized a continuum of speech sounds by varying only rise time. 
Listeners identified these sounds as /tja/ or / J*a/ as in CHOP or SHOP, respec- 
tively, and perceived them categorically as well. 

Our criteria for categorical perception were those suggested by Studdert- 
Kennedy, Liberman, Harris, and Cooper, (1970): (a) "peaks" of high discrim- 
inability between stimuli in restricted regions along the dimension studied; 
(b) "troughs" of discrimination performance near chance in regions on either 
side of the peak; and (c) correspondence between discrimination peaks and 
troughs and the course of identification functions, with peaks occurring at 
identification boundaries and troughs occurring within each perceptual cate- 
gory. Categorical perception is therefore revealed by a particular combina- 
tion of results from identification and discrimination tasks. This conver- 
gence between identification and discrimination ds. unusual; a listener, gener- 
ally can discriminate many more stimuli than he or she can identify absolutely 
(see, for example. Miller, 1956). 



* To appear in Percep^tion and Psychophysics . 
^ Also Wesleyan University, Middletown, Conn. 
University of Pennsylvania, Philadelphia, Pa. 
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The correspondence between identification and discrimination can be test- 
ed quantitatively. Discrimination performance can be predicted from identifi- 
cation data by assuming that discrimination is no better than identification. 
To the extent that obtained and predicted discrimination scores do not differ 
significantly, categorical perception has occurred. 

Our previous paper described such agreement between obtained and predicted 
discrimination scores for both the linguistic and musical sounds (Cutting and 
Rosner, 1974). Unfortunately, the predicted functions were derived through an 
incorrect formula. This note corrects that error. 

To predict discrimination from identification of a two-category continuum, 
the correct formula for an ABX discrimination task is 

P(c) = 1/2[1 + (p^" - p^)^] (1) 

where P(c) is the probability of a correct discrimination, p^ is the probabil- 
ity of assigning stimulus A to one of the categories, and p.2 is the probability 
of assigning stimulus B to that same category. The original formula for the 
three-category case published by investigators at the Haskins Laboratories 
(Liberman, Harris, Hoffman, and Griffith, 1957) was incorrect as printed; 
Pollack and Pisoni (1971) give proper formulae for both two- and three-category 
continua. We will refer to (1) as the Haskins prediction. 

Typically, obtained discrimination functions, even for stop consonants, 
systematically exceed predicted functions by as many as ten percentage points 
at each comparison along the stimulus array. Thus, the strongest possible re- 
lationship between identification and discrimination is not realized (see also 
Barclay, 1972; Pisoni and Lazarus, 1974; and Pisoni and Tash, 1974). The dis- 
crepancy between obtained and predicted discrimination functions is even larger 
for more "continuously" perceived stimuli such as vowels (Pisoni, 1971, 1973, 
1975). By further developing a model that Fujisaki and Kawashima (1970) for- 
mulated, Pisoni added a correction factor to prediction formulae such as (l) . 
This factor is based on the asymptotic trough discrimination value; it raises 
the predicted functions by several percentage points, and it can be interpreted 
as measuring short-term auditory storage for differences between two stimuli 
identified alike. For a two-category continuum in an ABX task, the proper 
Fujisaki-Kawashima prediction formula is 

p(c) = i/2[(p^ - p^)^ + p^d - + ^2^^ ■ ^1^^ 

[P1P2 + (1 - Pi)a - P2)]T (2) 

where P(c), p^, and P2 are the same as in (1) and t is the asymptotic trough 
value of the obtained discrimination function. If T = 0.50, (2) reduces to (1) • 
Like the Haskins prediction formula, the Fujisaki-Kawashima formula has suffered 
the misfortune of appearing incorrectly in print (Pisoni, 1971:44; Pisoni, 
1975:13). 1 



Page numbers for Pisoni (1971) refer to a version published as a supplement 
to the Haskins Laboratories Status Report on Speech Research . 
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Using the correct formulae we have recomputed both the Haskins and the 
Fuj isaki-Kawashima predictions for our data on discrimination of sawtooth 
waves and of affricate-fricative speech syllables. Predictions were made 
for each individual listener, then averaged functions were obtained from the 
individual functions, as Pisoni (1971) suggests. ^ Table 1 shows averaged ob- 
tained and predicted discrimination scores. 



TABLE 1: Obtained and correctly predicted discrimination values for stimuli 
differing in rise time. The original predicted functions that 
appear in Cutting and Rosner (1974) are incorrect. 

Rise time comparison (msec) 
0-20 10-30 20-40 30-50 40-60 50-70 60-80 

Experiment 1 

Sawtooth wave stimuli 
Obtained 

Haskins predicted 
Fuj isaki-Kawashima predicted 
Speech stimuli 
Obtained 

Haskins predicted 
Fuj isaki-Kawashima predicted 

Experiment 2 

Sawtooth wave stimuli 
Obtained 

Haskins predicted 
Sine wave stimuli 
Obtained 

Haskins predicted 
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The predicted functions in Table 1 are farther below the obtained func- 
tions than were those originally published [see Tables 1 and 2 in Cutting and 
Rosner (1974)]. Nevertheless, the discrepancies between predicted and obtained 
scores here are not marked. Goodness-of-f it measures calculated from individ- 
ual-obtained and Raskins-predicted scores revealed no significant differences 
(see Pisoni, 1971:20), although the observations per comparison may be too few 
to make small differences statistically reliable. The fit between the data 
and the correct predictions still supports our prior conclusion that musical 
stimuli and affricate-fricative consonants differing in rise time rre each 
perceived categorically. Subsequent experiments have provided confirmation; 
Cutting, Rosner, and Foard (1976) have demonstrated that the musical sounds 
are perceived as categorically as stop consonants in Pisoni 's (1971, 1973) vari- 
able-interval AX discrimination task. 

In summary, this note presents correct predicted discrimination functions 
for data previously published (Cutting and Rosner, 1974). The corrections 
leave the principal conclusion of that study unchanged: nonlinguistic and 



2 

The trough value T was not stable for individual listeners; we assumed it to 
be 0.60 for all listeners for both sets of stimuli represented in Table 1. 
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linguistic stimuli synthesized with different rise times are perceived cate- 
gorically. In addition, this note provides correct formulae for predicting 
discrimination functions. Several previous sources for the formulae are in 
error . 
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Right-Ear Advantage for Musical Stimuli Differing in Rise Time 
Mark J. Elechner* 



ABSTRACT 

Nonspeech stimuli differing in rise time, which resemble the 
sounds of plucked or bowed violin strings, were presented monaurally 
with contralateral noise, and reaction times for stimulus identifica- 
tion were measured. Reaction times were 12.8 msec faster when the 
stimulus was presented to the right ear than to the left ear, suggest- 
'•^tl.ing left-hemisphere involvement in the processing of these stimuli. 
This finding, considered along with other studies using the same 
stimuli, suggests that a single psychological mechanism is involved 
in the processing of the plucked and bowed sounds and consonant-vowel 
stimuli. In addition, the data support the theory that the dominant 
cerebral hemisphere is specialized for the processing of temporal 
variation'. 

The distinction between auditory and phonetic processes in the human per- 
ception of sounds has been a topic of much debate in recent years. Phonetic 
processing implies a mode of perception unique to speech stimuli. It is char- 
acterized by the fact that there is no one-to-one relationship between the acous- 
tic stimulus and percept, and that perception appears to be modulated by rules 
of linguistic rather than acoustic organization (Liberman, Cooper, Shankweiler, 
and Studdert-Kennedy, 1967). 

Wood (1975) listed six experimental operations whose results have been 
thought to converge on the distinction between auditory and phonetic processes.-^ 
Three of these characteristic data patterns, however, have been found with a 
particular kind of nonspeech stimulus — sawtooth waves differing in rise time, 
which resemble the sound of a plucked or bowed violin string. The plucked and 
bowed sounds, like consonant-vowel (CV) syllables show: categorical perception 



*Also Yale University, New Haven, Conn. 
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relevant research and a theoretical exposition of this viewpoint. 
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(Cutting and Rosner, 1974), referred to by Wood as the phoneme-boundary effect; 
boundary shifts due to selective adaptation (Cutting, Rosner, and Foard, 1976); 
and asymmetric interference with redundancy gain in a speeded classification 
task (Blechner, Day, and Cutting, 1976). The remaining three experimental re- 
sults cited by Wood are: right-ear advantage for identifying dichotically pre- 
sented speech; right-ear advantage for reporting the temporal order of dichotic 
speech stimuli; and unilateral differences in average evoked potentials during 
the classification of linguistic and nonlinguistic dimensions. All three of 
these appear to reflect a single factor, that is, the lateralization of the 
cerebral hemispheres. It therefore seems quite pressing to determine which, if 
any, hemisphere is predominantly involved in the perception of plucked and bowed 
sounds, but so far data on this issue have been indecisive. Cutting, Rosner, and 
Foard (1975) found that dichotic presentation of the plucked and bowed sounds 
showed no significant ear advantage, but a null result in a dichotic study need 
not be considered conclusive, since it could result from the inadequate sensi- 
tivity and precision of the measure used. 

One way of achieving a decisive finding where null results have predomin- 
ated, is to use a potentially more sensitive measure, such as reaction time 
rather than accuracy. Springer (1973) has developed a means of reflecting 
hemispheric specialization through a reaction-time measure. She presented CV 
syllables monaurally with contralateral white noise and found a 14-msec advan- 
tage for stimuli presented to the right ear. 

The purpose of the present study was to detect a potential ear advantage 
for plucked and bowed sounds using Springer's paradigm, with one modification: 
Springer had subjects respond only with the right hand, raising the possibility 
that the observed ear advantage might have been due to intercallosal transfer 
tim^ rather than to hemispheric processing capacities. In the present study, 
therefore, both ear of presentation and hand of response were counterbalanced. 

METHOD 

Stimuli 

The stimuli were identical to those used previously by Blechner et al. 
(1976) . They were derived from the sawtooth wave sounds used by Cutting and 
Rosner (1974), originally generated on the Moog synthesizer at the Presser 
Electronic Studio at the University of Pennsylvania. The stimuli differed in 
rise time, reaching maximum intensity in either 10 msec (pluck) or 80 msec (bow). 
Using the pulse code modulation (PCM) system at Raskins Laboratories, the 
stimuli were truncated to 800 msec in duration and were stored on disc file in 
digitized form. 

The white noise, which was to be presented contralaterally to the stimuli, 
was generated by a General Radio random-noise generator (Model 1390-A) and had 
a bandwidth of 20 kHz. The noise was digitized using the PCM system, truncated 
to a duration of 1000 msec, and then stored on disc file. The absolute levels 
of the noise and target stimuli (pluck and bow), as presented to listeners, were 
80 and 70 dB SPL, respectively. All sounds were reconverted to analog form at 
the time of tape recording. 
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Tapes 



All tapes were prepared using the PCM system. A display tape was prepared 
to introduce the subjects to the stimuli. The two kinds of stimuli (pluck and 
bow) were played in the same order several times, beginning with three tokens 
of each item, then two of each, and finally one of each. 

Two binaural identification tapes were prepared, each with 32 tokens of the 
pluck and bow stimuli (16 of each) in random order. 

Four dichotic test tapes were recorded. On one channel of each test tape, 
60 tokens of the pluck and bow stimuli were recorded in random order with the 
constraint that every 10 stimuli contained equal numbers of pluck and bow stim- 
uli. Thus, long runs of any one kind of stimulus were prevented. Sixty units 
of white noise were recorded on the second channel of the tape, with noise onset 
preceding stimulus onset by 50 msec. In addition, a 50 msec 1000 Hz tone that 
triggered the reaction time counter was recorded on both channels. The onset or 
this tone preceded the onset of the noise by 1.55 j^econds. An interval of 2 
seconds separated the offset of thenoise from the onset of the next trigger 
tone. The intensity of the trigger tone was equivalent to the maximum intensity 
of the pluck and bow stimuli. 

Four dichotic practice tapes were also prepared. These were identical in 
design with the test tapes but contained only 20 stimuli each. 

Subjects and Apparatus 

The 16 participants in the experiment included six males and ten females, 
ranging in age from 18 to 22 years. All were strongly right handed, as indi- 
cated by the five most reliable criteria found by Annett (1970). All reported 
no history of hearing trouble. ' 

The tapes were played on an Ampex AG-500 tape recorder, and the stimuli 
were presented through calibrated Telephonies headphones (Model TDH39-300Z) . 
Subjects sat in a sound- insulated room and responded with their index finger on 
either of two telegraph keys mounted on a wooden boards Throughout the experi- 
ment, the left key was used for bow responses, while the right key was used for 
pluck responses. The 50-msec pulse preceding each stimulus triggered a 
Hewlett-Packard 522B Electronic Counter. When a response on either telegraph 
key stopped the counter, the reaction time was printed on paper tape by a 
Hewlett-Packard 560A digital recorder for subsequent analysis. The listener's 
response choice was recorded manually by the experimenter. 

Procedure 

Listeners participated individually in a sound-insulated room. At the 
start of each session, they were informed of the general nature of the experi- 
ment and of the particular kinds of sounds that they would be asked to identify. 
They were told that the difference in rise time would be* compared to the dif- 
ference in sound between a plucked and a bowed violin string. 

For preliminary training, subjects listened to the dir.play sequence. They 
were then instructed on the mode of response, after -which they listened to the 
display sequence twice more, responding to the sounds first with the left hand 

65 

GO 



and chen with the right. Next, they listened to the binaural identification 
tapes. Eight of the subjects responded to the first tape with the left hand and 
to the second with the right hand. For the other eight subjects, the order of 
responding hands was reversed. 

Subjects were then told that they would hear the stimulus in one ear, to 
which they were to pay careful attention, while there would be noise in the 
other ear, which they should ignore. They were played a few samples from the 
dichotic tapes to familiarize them with the noise-stimulus combination. They 
then listened and responded to the four practice tapes, and, after a five minute 
rest period, to the four test tapes. 

For each individual listener, the pluck and bow stimuli were always pre-^ 
sented through the same headphone. Ear of presentation was alternated by having 
the listener reverse the headset. For eight of the participants the stimulus 
was presented through one of the headphones, while for the other eight it was 
presented through the opposite headphone. 

There were four possible hand-ear configurations. The order of these condi- 
tions was determined by a balanced Latin square design, yielding four possible 
orderings that were administered to four subjects each. The four practice and 
test tapes, however, were always played in the same order, to prevent any possi- 
ble confusion between the effects of the random orders and the hand-ear config- 
urations. 

Subjects were instructed to respond as quickly and accurately as possible. 
In the final data analysis, only the last 50 test trials in each block were con- 
sidered,- the first ten functioning as warm-up trials to stabilize performance. 
The listener, however, was not told that the first ten trials would not count. 

RESULTS 

All of the subjects were able to identify the pluck and bow stimuli accur- 
ately. In the binaural identification trials, no listener made more than 4.7 
percent errors. 

For the reaction time data of the task with contralateral noise, median 
reaction time was calculated for each block of test trials for each subject. An 
analysis of variance was performed on these medians, with order of conditions 
considered as a between-subject factor, and hand and ear of presentation as 
within-subject factors. 

The mean across subjects of individual medians for right-ear presentation 
of the stimuli was 662.5 msec, while for left-ear presentation, the mean was 
675.3 msec. This 12.8-msec advantage for right-ear presentation was statisti- 
cally significant, F(l,12) = 5.69, p < .05. Collapsed over ear of presentation, 
mean right-hand response was 665.0 msec, while mean left-hand response was 
672.8 msec. This 7.1-msec difference, however, was not statistically reliable. 
All other main effect and interaction terms were not significant. 

Accuracy in this experiment was quite high. The mean error rate was 0.9 
percent.. An analysis of variance on the error data showed no significant main 
effects or interactions. 
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DISCUSSION 



The Issue of Special Processing for Phonetic Dimensions 

The finding of a significant right-ear advantage for the identification of 
plucked and bowed sounds is very similar to the results for CV syllables,- and 
suggests left-hemisphere involvement in the processing of both kinds of sounds • 
When the present data are considered along with other studies using plucked and 
bowed sounds, the parallels between this kind of nonspeech sound and CV sylla- 
bles are quite compelling. The nonspeech stimuli have yielded all of the basic 
data patterns cited by Wood (1975) as evidence converging on the distinction 
between auditory and phonetic processes. The plucked and bowed sounds — like the 
speech stimuli — show asjmimetric interference with redundancy gain in the speeded 
classification task, categorical boundary effects, selective adaptation of the 
category boundary, and evidence of left-hemisphere specialization. Considered 
together, this constellation of results with nonspeech stimuli leads one to 
question the existence of a special mode of processing for speech stimuli 
(Liberman, 1970), at least on the phonetic level. One might perhaps argue that 
identical results with CV syllables and plucked and bowed sounds do not guaran- 
tee identical perceptual mechanisms. Nevertheless, at the present time, it 
seems most parsimonious to account for these results in terras of a single mech- 
anism for processing complex auditory dimensions that cue significant distinc- 
tions for a subject, rather than assuming, as Wood (1975) did, that results re- 
flect separate mechanisms for phonetic and higher level auditory processes. It 
should be emphasized, however, that the conclusion proposed here does not -chal- 
lenge the notion of unique perceptual processes on other levels of linguistic 
organization. 

Relevance to Specific Theories of Hemispheric Specialization 

Although the present "data have their greatest impact when considered within 
a set of converging experimental operations, they are relevant also to the 
specific question of the functions of the two cerebral hemispheres. Kimura 
(1967) suggested that the left and right hemispheres might be specialized, re- 
spectively, for verbal and nonverbal stimuli. This proposition has since been 
questioned by the discovery of right-ear advantages for nonspeech stimuli (for 
example; Halperin, Nachshoh, and Carraon, 1973). The present study adds another 
set of data that contradicts the verbal-nonverbal dichotomy of hemispheric 
specialization. 

Bever (1975) has suggested an alternative viewpoint, stressing the impor- 
tance of different kinds of processing, rather than intrinsic stimulus variables 
in accounting for lateral asymmetry. He hypothesizes two modes of perception, 
analytic and holistic, for the left and right hemispheres, respectively. This 
view purlports to^ account for individual differences in hemispheric specialization 
for melodies as a function of musical ability (Bever and Chiarello, 1974). How- 
ever, the analytic-holistic distinction as currently formulated has little pre- 
dictive value for plucked and bowed sounds. After looking at the data, one 
might suggest that they require- analytic processing. After all, the stimuli 
differ in rise time, a small difference that might easily be missed if the stimu- 
lus were treated more globally. Other evidence, however, contradicts this view. 
Cutting et al. (1976), for example, have demonstrated that while rise time cues 
the distinction between plucked and bowed sounds, it is not an entirely suffi- 
cient, cue. Fully half a second of waveform after stimulus onset is required for 
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the items to be identified properly. Thus, the whole of the stimulus is necessary 
for perception, and this fact would suggest a holistic mode of processing. 

It may be more accurate to account for the present data based on the stimu- 
li's acoustic nature rather than in terms of processing strategy. One particu- 
larly important acoustic property of the plucked and bowed stimuli in terms -of 
hemispheric specialization is their characteristic rapid acoustic variation. 
Several studies have implicated the resolution of temporal variation as a left- 
hemisphere mechanism, both in audition (Halperin et al. , 1973; Cutting, 1974) 
and vision (Goldman, Lodge, Hammer, Semmes, and Mishkin, 1968; Carmon and 
Nachshon, 1971). It may well be that the rate of shift in amplitude which dis- 
tinguishes the plucked from the bowed sounds, is responsible for the greater 
left-hemisphere involvement. 
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Dichotic Competition of Speech Sounds: The Role of Acoustic Stimulus Structure* 
Bruno H. Repp"^ 



A-->STRACT 

Dichotic consonant'-vowel syllables contrasting in two features 
of the initial stop consonant (voicing and place) were presented for 
identification in a single-respon&e paradigm without selective atten- 
tion instrui:*tion3. The acoustic structure of the syllables was 
varied wi\:hin categories on both dimensions [voice onset time (VOT) 
and formanv: transitions]. These variations (especially those in VOT) 
had a clear influence -on the pattern of responses (including blends) , 
thus ruling out a simple phonetic feature recombination model. 
Rather, the auditory properties of the stimuli seem to be preserved 
at the stage of dichotic interaction. An alternative model (the 
''prototype model*') , which assumes that dichotic integration of in- 
formation takes place at a '*multicategorical** stage intermediate be- 
tween auditory and phonetic processing, is only moderately supported 
by the data. Nevertheless, some arguments are presented for maintain- 
ing this model as a working hypothesis. A new procedure for estimat- 
ing the dichotic ear advantage was applied here for the first time, 
together with the single-response requirement. Most subjects showed 
unusually large right-ear advantages, which makes the present method- 
ology interesting for the study of hemispheric asymmetry. 

INTRODUCTION 

Many recent studies of dichotic listening have employed synthetic syllables 
as stimuli, most often the set /ba/, /da/, /ga/, /pa/, /ta/, /ka/ . These sylla- 
bles offer a number of advantages over other materials. As synthetic syllables, 
their acoustic properties can be precisely controlled. Phonetically, they are a 
homogeneous stimulus set that represents all possible combinations of two values 
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of the voicing feature (voiced, voiceless) and three values of the place feature 
(labial, alveolar, velar). They also yield a reliable right-ear advantage (REA) 
which often tends to be larger than the REA for other classes of competing 
speech sounds (Haggard, 1971; Blumstein, 1974; Cutting, 1974). 

The Feature Recombination Hypothesis 

Detailed studies of the dichotic competition between the six stop conso- 
nants have revealed several interesting phenomena, one of which will be of 
special interest here. When the two competing stimuli differ on both dimensions 
(voicing and place; for example, /ba/-/ta/), many errors are obtained that com- 
bine correct feature values from the two ears, such as /pa/ or /da/ as re- 
sponses to /ba/-/ta/. These responses have been termed blend errors (Halwes, 
1969; St udder t-Kennedy and Shankweiler, 1970). Blend errors are responsible for 
another finding often called the '"feature-sharing advantage" (which actually is 
a feature-contrast disadvantage): dichotic syllables that differ in both fea- 
tures receive fewer correct responses than syliabjies that contrast only in a 
single feature (Halwes, 1969; Studder t-Kennedy and Shankweiler, 1970; Studdert- 
Kennedy, Shankweiler, and Pisoni, 1972; Pisoni, 1975). These two phenomena— 
which are basically the same, since blend errors can occur only with double- 
feature contrasts and therefore lead to higher error rates for these dichotic 
pairs--have provided the primary support for a feature recombination model of 
dichotic interaction. In its simplest form, this model assumes that phonetic 
features are: (1) independently extracted from the auditory information arriv- 
ing from each hemisphere; (2) stored in a common feature buffer where information 
about the origin of the feature values is lost; and (3) finally recombined into 
percepts or responses. In other words, it is assumed that the interaction be- 
tween dichotic stimuli takes place after the extraction of phonetic features, 
and that the competing values of a particular feature have equal probabilities 
of being selected from the feature buffer, independent of other particular 
features. Although this model has not always been clearly stated in the past, 
it was implicit in most previous research on dichotic competition (Halwes, 
1969; Studdert-Kennedy and Shankweiler, 1970; Studdert-Kennedy , Shankweiler, 
and Pisoni, 1972; Blumstein, 1974; Pisoni, 1975; Cutting, 1976). 

This simple model makes several st*:ong and easily testable predictions, 
some of which have been examined by Halwes (1969). If all information about the 
local origin of the feature values is lost, double-feature contrasts should re- 
ceive an equal number of correct responses and blend errors, and the two possible 
blend (and correct) responses should also be equally frequent. However, Halwes 
found correct responses to be twice as frequent as blend errors. This result 
could be accommodated by assuming that some of the local information is re- 
tained, so that feature values that come from the same hemisphere have a better 
than even chance of being selected together to form a response. However, 
Halwes also found wide variation in the frequencies of blend errors for differ- 
ent individual stimulus combinations, as well as strong asymmetries in the fre- 
quencies of the two possible blend (and correct) responses for individual stimu- 
lus pairs. He suggested that unequal salience of different acoustic cues may 
have played a role, but he did not indicate how this idea could be incorporated 
in the feature recombination model (which he did not explicitly reject). 



In fact, it is possible to maintain the basic structure of the model, if 
the additional assumption is made that individual phonetic feature values have 
different strengths or saliencies, which are reflected in unequal probabilities 
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of selection from the phonetic feature buffer. The question remains: What 
determines these strengths? One possibility is that they are inherent — that 
they have a phonetic basis. The other possibility, suggested by Halves (1969), 
is that' they reflect the acoustic structure of the stimuli. If the latter hy- 
pothesis were true, the simple phonetic feature recombination model would have 
to be rejected, since it rests on the basic assumption that dichotic competition 
is exclusively phonetic in nature. 

In order to test these hypotheses, let us consider another pre ?'' >?r^n - of 
the model. This prediction is that acoustic stimulus variations with, jc. fy^i^etlc 
categories should not affect the frequency of blend errors and, indeed, chcald 
leave the whole response pattern unchanged. Since the phonetic features are 
assumed to be extracted independently before the combination of information from 
the two hemispheres, acoustic within-category variations can affect only the 
feature extraction process, but not the subsequent recombination of the features. 
By definition, within-category variations do not affect the accuracy of phonetic 
feature extraction (if they do, they are not true within-category variations), 
so that their effect in dichotic competition should be nil. This null hypothe-- 
sis, whose maintenance is essential to the survival of the feature recombination 
model, was the focus of the present study. A rejection of the hypothesis was 
expected, since an alternative model that predicted specific effects of within- 
category acoustic variations was available. 

The Prototype Model 

This alternative model has been proposed by Repp (1976b, in press). It 
differs from the feature recombination model, as it considers syllables not as 
bundles of separately extracted phonetic features, but as integral multidimen- 
sional entities whose dimensions are inseparable aspects of the whole pattern 
(cf. Lockhead, 1970, 1972; Gamer, 1974; see also the present discussion). The 
dimensions are assumed to reflect the auditory properties of the stimulus and 
thus are continuous, not binary. Instead of representing speech sounds as 
matrices of ciiscrete feature values, they are conceptualized as points in a con- 
tinuous multidimensional perceptual space. In the same auditory space, a lim- 
ited number of fixed "prototypes" are located, which represent the listener's 
"ideal" concepts (his tacit knowledge) of the relevant phoneime or syllable cate- 
gories. According to this prototype model, a stimulus is identified in three 
stages: (1) First, auditory processing leads to a mapping of the acoustic in- 
formation into the multidimensional space. (2) In this perceptual space, the 
stimulus leads to "activation" of the prototypes in its vicinity, the degree of 
activation being an inverse and probably nonlinear function of the (Euclidean) 
distance between stimulus and prototype. This results in a "multicategorical 
vector"^ whose elements are the activation values of the prototypes. (3) Final- 
ly, a probabilistic decision process selects the prototype with the largest 
activation value as the response (or percept). 

In the prototype model, dichotic interaction is assumed to take place at 
the level of multicategorical representation, in the form of a weighted averag- 
ing of the multicategorical vectors for the two stimuli. A single categorical 
decision is then made on the basis of this average vector. Thus, the model 
assumes that the competing information is combined and results in a single per- 
cept. This assumption is justified when synthetic syllables with the same fun- 
damental frequency and in the same vocalic context are used because these stimu- 
li strongly tend to fuse in dichotic competition (Halwes, 1969; Repp, 1976b; 
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Repp and Halves, in preparation). The nature of the single categorical percept 
is determined by two factors: ear dominance , represented by the weights in the 
averaging process, and stimulus dominance , which is determined by the relative 
distances of the two competing stimuli from the prototypes in the perceptual 
space. The model predicts that stimuli that are close to a prototype will tend 
to domii. ate stimuli that are far from prototypes; this may be called the "cate- 
gory gooaness hypothesis** of dichotic competition. Category goodness, that is, 
the distance from the "correct" prototjrpe, is a function of auditory stimulus 
characteristics, so that the model predicts that stimulus dominance will vary if 
acoustic within-category variations of the stimuli are introduced. This was 
.confirmed by Repp (1976b) within a restricted stimulus set — that of the voiced 
stop consonants. By varying the initial formant transitions, the dominance re- 
lationships between the stimuli from a "place continuum" could be reliably in- 
fluenced, and the pattern of the data conformed at least qualitatively to the 
prototype model. 

The present experiment investigated the generality of these earlier find- 
ings. In order to be useful, the prototype model should explain the response 
pattern for all dichotic combinations of the six stop consonants, as well as the 
effects of variations in cues other than the initial formant transitions. Con- 
sider first how the model explains blend responses. Two stimuli such as /ba/ 
and /ta/ will not only activate their correct prototypes (B and T, respectively) 
but also, to a lesser degree, the blend prototypes, D and P, which are neighbors 
in perceptual space. Because of the presumed additivity of prototype activation 
levels, the blend prototypes may reach activation levels comparable to those of 
the correct prototypes, to which only one of the two stimuli makes a substantial 
contribution. 

In principle, this model allows for variations in the frequencies of blends 
between individual stimulus pairs, since they depend in a complex way on the 
arrangement of prototypes and stimuli in the perceptual space. A mathematical 
formulation of the model should be able to predict their pattern. In the pres- 
ent context, however, we will be content with qualitative predictions concerning 
changes in the response pattern, leaving quantitative tests to a future study. 

Contrary to the feature recomb .aation model, the prototype model predicts 
variations in the response pattern with changes in the acoustic structure of the 
stimuli. Consider again the previous example, the stimulus pair /ba/-/ta/. 
Assume that we delay the voice onset time (VOT, the important acoustic cue for 
the voicing feature) of /ba/, so that the stimulus is still identified as B, but 
in the perceptual space it is farther removed from the B protot3rpe and closer to 
the P prototype. It will now be clever to the boundary between voiced and 
voiceless sounds, and it will contribute less activation to B and D and more to 
P and T than the original /ba/. As a result, the frequencies of P and T re- 
sponses should increase, and that of B and D responses should decrease. Similar 
predictions may be made for changes of VOT in the other direction or in the other 
stimulus, or for changes in the formant transitions (the acoustic cue for place 
of articulation) of either stimulus. A number of other, more detailed, predic- 
tions may be derived from the model, some of which xd.ll be considered in the 
Results section of this paper. 

The ph. ....v. feature recombination model and the prototype model are not the 
only possible conceptions of the process of dichotic interaction, but most other 
plausible models are compromises between these two extremes (see the Discussion 
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section) • The detailed formulation of such models seems less important than the 
empirical demonstration of within-category effects in dichotic .competition; such 
a demonstration would rule out a whole class of models. 

In addition to the primary focus on stimulus dominance in dichotic competi- 
tion, the present study gave attention to the factor of ear dominance. A new 
method of calculating ear advantage indices, especially designed for the single- 
response paradigm (Repp, 1976a, 1976b; Repp and Halves, in preparation) was 
applied here for the first time. This experiment constituted part of an ongoing 
series of studies aimed at developing optimal procedures for assessing lateral 
asymmetries in dichotic listening, 

METHOD 

Subjects 

The subjects were eight paid volunteers^ four women and four men, mostly 
Yale students. All had normal hearing, except one man who claimed to have a 
slight (5 dB) hearing loss in the right ear. Two subjects were left-handed, 
one of them only in writing. All were relatively inexperienced listeners. 

Stimuli 

The stimulus set comprised 24 syllables which were synthesized on the 
Raskins Laboratories parallel resonance synthesizer. There were four acousti- 
cally different versions of each of the six syllables, /ba/, /da/, /ga/, /pa/, 
/ta/, and /ka/, resulting- from all combinations of four different VOTs with six 
different (second- and third-) formant transitions, as illustrated in Figure 1. 
All syllables were 300 msec long, had no initial bursts, the same transition 
durations (50 msec),-^ and the same constant fundamental frequency (90 Hz).' 

The experimental tape was recorded using the pulse code modulation (PCM) 
system at Haskins Laboratories. The tape contained first a list of 120 single 
syllables consisting of five different random sequences of the 24 stimuli. It 
was followed by two blocks of dichotic pairs. Each block contained 192 pairs, 
representing all possible double-feature contrast combinations of the 24 stimuli 
six phoneme combinations (/ba/-/ta/, /ba/-/ka/, /da/-/pa/, /da/-/ka/, /ga/-/pa/, 
/ga/-/ta/) with two channel/ear assignments for each, and sixteen different 
acoustic combinations within each phonemic contrast. Their sequence was com- 
pletely random, with inters timulus intervals of 3 seconds. The onsets of the 
syllables in a dichotic pair were exactly simultaneous (0.125 msec maximal 
error) . 

Procedure 

The subjects were tested in small groups in a single session lasting about 
two hours. The single-channel series was presented monaurally for identifica- 
tion, followed by the two dichotic blocks. After a break, the tape recorder 
channels were reversed electronically and the two dichotic blocks were presented 



It was discovered after the experiment that the f irst-f ormant transitions of 
the labial consonants were only 40 msec long. However, this was almost cer- 
tainly of no consequence. 
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TRANSITION ONSETS(Hz) 
F2: 846 996 1465 1620 1920 2078 
F3: 2180 2525 3195 3195 2525 2180 



0 

+15 

E 

i +40 
+55 



1 

-/ua/ 


1 


\ 

-/ga/- 


-/pa/- 

1 


Ml 
1 


/ka/- 
1 



Figure 1: Acoustic stimulus parameters. The steady-state frequencies for /a/ 
were 1232 Hz (F2) and 2525 Hz (F3) . 
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again, followed by the monaural syllables, now in the opposite ear. All in all, 
each subject listened to ten replications of each single syllable and to four 
replications of each dichotic pair (eight, if channel/ear assignment is ignored). 
The tape was played back from an Ampex AG-500 tape recorder through an ampli- 
fier/attenuator to Telephonies TDH-39 headphones. The intensities of the two 
channels were carefully equalized at about 65 dB SPL (peak deflections on a 
voltmeter). 

As part of the instructions, the subjects were first given a talk on the 
two features — voicing and place — and were told the precise stimulus combinations 
to expect, with the help of a diagram on the answer sheets. However, they were 
not informed about the within-category variations until after the experiment. 
The iiiubjects were asked to write down a single response for each dichotic pair, 
whatever the fused stimuli sounded most like. Naturally, the responses were re- 
stricted to the six stop consonants, with the additional admonition to try to 
give both voiced and voiceless responses.^ 

RESULTS AND DISCUSSION 

Monaural Intelligibility 

As is often the case with synthetic syllables, their intelligibility in the 
experiment turned out to be somewhat poorer than anticipated. The confusion 
matrix for all eight subjects is shown in Table 1. The problem lay almost ex- 
clusively with /da/ and /ta/ which were more often heard as /ga/ and /ka/, re- 
spectively. The absence of a burst, which is especially important in alveolar 
consonants, may have been a factor here. The conf usability of these stimuli was 
not detrimental to the purpose of the experiment, although it had to be dealt 
with in the analysis of the dichotic data. 

Confusions along the voicing dimension were extremely rare and occurred ex- 
clusively at the VOTs closer to the boundary. A similar pattern may be seen for 
/ga/ and /ka/ with respect to place confusions; alveolar responses were more fre- 
quent when the velar transitions were closer to the boundary (low). However, 
for /da/ and /ta/ the opposite was the case; velar responses were more frequent 
when the transitions were farther away from the alveolar-velar boundary (low). 
This curious reversal has been confirmed in other studies using the same stimuli 
(Repp, in preparation); its explanation is far from clear. 



"It was thought that some subjects might give predominantly voiceless responses, 
which would have reduced the information in the data. This suspicion, derived 
from pilot observations, was apparently unfounded. For the. same reason, four 
subjects (two old and two new) were (re) tested with the same tape with detection 
instructions. These instructions restricted the response set to either the 
voiced consonants (B, D, G) or the voiceless consonants (P, T, K) only, counter- 
balanced across blocks within subjects. Since the subjects knew that each 
dichotic pair contained one voiced and one voiceless consonant, this- amounted 
to a detection task. The main purpose of the detection instructions was to 
force the subjects to give an equal number of voiced and voiceless responses to 
each pair, and, consequently, only the effects of variations in formant transi- 
tions could be assessed. These effects agreed with those under standard in- 
structions, as described in the Results section. 
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TABLE 1: Confusion matrix of the 24 stimuli (monaural identification). 



Stimuli 
VOX F2 


Responses 
B D G P T K 


0 low 

/ba/ J ^^^^ 
+15 low 

+15 high 


80 - _ - - - 
80 - - - - - 
76 - - 4 - - 
79 - - 1 - - 


0 low 

/da/ J ^^^h 
+15 low 

+15 high 


- 28 52 

- 34 46 

- 27 53 

- 39 40 - - 1 


0 low 
/ / 0 high 
+15 low 
+15 high 


- 5 75 - - - 

80 - - - 

- 16 64 - - - 

- - 79 - - 1 


+40 low 
, , +40 high 
/P^/ +55 low 

+55 high 


- - - 80 - - 
1 - - 78 - 1 

- - - 80 - - 

- - - 80 - - 


+40 low 

/ta/ 

+55 low 
+55 high 


1 22 57 
1 - 1 34 44 
3 30 47 
- 61 19 


+40 low 

/ka/ ^^^^ 
+55 low 

+55 high 


4 - 12 64 

- 1 2 - 4 73 

2 6 72 

- - - 2 1 77 



The Dichotic Response Pattern 

The dichotic response pattern for the six phonemic contrasts, disregarding 
within-category variations, is shown* in Table 2. The underlined percentages 
represent blends; their total frequencies are given in the last column. It can 
be seen that blend responses were extremely common but varied in frequency as a 
function of the stimuli involved: in the two pairs containing /ba/ , blend re- 
sponses comprised almost two-thirds of all responses; in the two pairs contain- 
ing /pa/, only about one- third; in the remaining two pairs, somewhat less than 
half. In these two last pairs (alveolar-velar contrasts), the exact proportion 
of blends was uncertain, as indicated by the parentheses in Table 2. Because of 
the listeners' uncertainty about the place of articulation of the component 
stimuli, blend responses could have arisen from either blending or from confu- 
sions and, likewise, "correct** responses may have included some true blends. 
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TABLE 2: Dichotic stimulus-response matrix. 



Percentage of responses 



stimuli 


B 


D 


G 


P 


T 


K 


Correct 


Blends 


/ba/-/ta/ 


11.4 


3.6 


+ 3.4 


56.7 


14.2 


+ io-:6 


36.2 


63.8 


/ba/-/ka/ 


13.5 


3.5 


+ 6.8 


56.2 


3.2 


+ 16.6 


33.5 


66.5 




4.8 


24.6 


+ 20.0 


23.6 


12.2 


+ 14.7 


68.2 


31.8 




1.2 


+ 16.5 


37.9 


1.7 


+ 8.2 


34.6 


(52.3) 


(47.7) 




7.1 


8.1 


+ 38.4 


24.6 


4.3 


+ 17.5 


71.1 


28.9 


/ga/-/ta/ 


0.8 


+ 10.9 


37.5 


1.3 


+ 17.3 


32.2 


(56.1) 


(43.9) 



The poor discrimination between alveolar and velar place is also reflected 
in the responses to the other pairs containing one labial consonant. Since the 
labials were highly intelligible (Table 1) , alveolar and velar responses were 
therefore simply grouped together in these dichotic pairs, as indicated by the 
plus signs in Table 2. For example, responses to /ba/-/ta/ were considered 
blends, while K responses were considered correct. In alveolar-velar pairs, the 
few labial responses that occurred (probably random errors) were combined with 
the alveolar responses. These groupings were maintained in all further data 
analyses. 

Table 2 shows enormous variation in the pattern of blend responses. In the 
two pairs containing /ba/, P responses predominated and were more than twice as- 
frequent as P responses to pairs actually containing /p'a/. - In terms of the 
prototype model, this indicates that /ba/ was far from the B prototype on the 
voicing dimension but close to it on the place dimension, that is, it was weak on 
the former but strong on the latter; hence the joint predominance of labial and 
voiceless responses. This suggests that the response pattern could perhaps be 
explained in terms of separate and independent competition of the two features — 
voicing and place — although this would contradict the prototype model. However, 
in the two pairs containing /pa/, for example, correct responses were much more 
frequent than predicted by this hypothesis, while in pairs containing /ba/, 
they were less frequent than predicted. Note that the hypothesis of feature in- 
dependence predicts tV^at responses in the different place categories should be 
proportional within voicing categories. However, the- stimulus pair /ga/-/pa/, 
for example, received five times as many G responses as B responses, but actual- 
ly fewer K than P responses^ * This result contradicts the hypothesis of feature 
independence in dichotic competition. In principle, this is compatible with 
the prototype model, although it is not yet clear whether a more rigorous, quanti- 
tative formulation of the model would be able to explain the detailed response 
pattern. The feature recombination model, on the other hand, cannot explain the 
variations in the proportions of blend responses for different stimulus pairs or 
the asymmetries in blend responses to individual pairs, thus confirming Halwes 
(1969). ^ 

Effect of Within-Category Variations in VOT 

These results are shown in Table 3. The data are shown as the percentages 
of voiced and voiceless responses, and of correct and blend responses to the four 
VOT combinations, averaged over the different phonemic contrasts and the varia- 
tions in formant transitions. yg 
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TABLE 3: Percentages of voiced and voiceless correct responses and blends 
function of VOT combinations.- 



VOT 


Correct Blends 
+40 +55 +40 +55 


Total 
+40 +55 


S 0 

w o +15 

CO 

S +15 

o 

> 


42.8 15.1 19.1 5.7 

16.3 ■ 33.1 21.8 46.1 
19.5 28.5 26.9 40.5 


61.9 20.8 
53,6 31.0 

38.1 79.2 
46.4 69.0 


0 

o +15 


59.1 48.2 40.9 51.8 
54.6 49.1 45.4 50.9 





Obviously, the variations in VOT had a strong effect on the response pat- 
tern. The most striking effect was produced by a change in the VOT of the 
voiceless stimulus. Voiceless stimuli with the shorter VOT (+40) led to a 
slight predominance of voiced responses, while those with the longer VOT ^-•■^^) 
brought about a predominance of voiceless responses. This is in agreement with 
the prototype model, since there is good reason to assume that a voiceless 
stimulus with a VOT of +55 will be closer to its prototype than a stimulus with 
a VOT of +40. On the other hand, the effect of a change in the VOT of voiced 
stimuli was less striking and showed an interaction with the VOT of the voice- 
less competitor. When the VOT of the latter was +40, the effect of a VOT change 
from 0 to +15 in the voiced stimulus was as predicted, that is, it led to a rel- 
ative decrease in the percentage of voiced responses. However, when the VOT of 
the voiceless stimulus was +55, the effect of the same change in the VOT of the ^ 
voiced stimulus had just the opposite effect. This interaction was unexpected ana 
is difficult to explain. 

This pattern of results was highly consistent between individual phoneme 
combinations and individual subjects, /malysis of variance of the percentages 
of voiced (voiceless) responses yielded a highly significant effect of the VOT 
of the voiceless stimulus (Fi 7 = 59.14, p < .0002) and a significant, interac- 
tion between the VOT of the voiced stimulus and the VOT of the voiceless stiAti- 
lus (Fj^ 7 = 24:63,. p-<- .002); The main' -'efface- df the VOT of the voiced stimulus 
was not 'significant. ; 

Table 3 also shows that the proportion of correct responses and bler.ds 
varied as a function of VOT. Correct responses were more frequent whera voiced 
responses were more frequent, while blends tended to accompany voiceless re- 
SDonses Note that the majority of all voiced responses were correct, while, 
among the voiceless responses, blends were more frequent than correct responses. 
This indicates that the place feature of voiceless stimuli was weak in competi- 
tion with the place feature of voiced stimuli. In terms of the prototype model. 
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it '-.iuggests that noise-excited formant trar.;^* .*::ions are a less effective cue to 
place of articulation than voiced transit- v'^a-^/. This is plausible since the 
preiient stimuli did not contain any bur.9T^i'-'*-a second important cue to place of 
c'irticulation that certainly is more impc. .;:.i5.rf^ in voiceless plosives. 

Ef f;^: z* :. of Within-Category Variations gor mant Transitions 

These results are shown in Tablf:: 4 -.is the percentages of responses with the 
"place of the voiced stimulus and with tae place of the voiceless stimulus, and of 



TABLE 4: Percentages of correct r^rsponses and blends with the place of the 

voiced (voiceless) stimv;lus as a function of transition combinations. 





Voiced 


Correct Blends 
close far close far 


Total 
close far 


Responses with place 
of voiceless stimulus 


Voiceless 

close 
far 


.9 25.1 29.1 33.0 
3 0 26.2 36.0 36.8 


61.0 58.1 
68.0 63.0 


Responses with place 
of voiced stimulus 


close 
far 


23.4 28.3 15.6 13.6 
20.1 25.7 11.9 11.3 


39.0 41.9 
32.0 37.0 


Tor..- 


close 
far 


54.3 52.6 45.7 47.4 
52.1 51.9 47.9 48.1 





correct responses and blends. The dimensions of each 2x2 subtable are the 
transitions of the voiced stimulus (rows) and of the voiceless stimulus 
(coliimns) . The transitions were classified according to whether they were close 
to or far from the category boundary separating the place values of the two com- 
peting stimuli. Thus, "close" refers to the higher F2 transitions for 
labials and for alveolars paired with velars, but to the lower F2 transitions 
for vfilars and for alveolars paired with labials. 

It is evident that the effect of variations in the formant transitions was 
much smaller than that of VOT, but it was in the direction predicted by the 
prototype model: responses with the place of the voiced stimulus were most fre- 
quent when the transitions of the voiced stimulus were far and those of the 
voiceless stimulus were close, and they were least frequent when the opposite 
was the case,^- This pattern was shown primarily by the correct responses; the 
blends followed a somewhat different pattern, tending to be least frequent when 
both stimuli were close and most frequent when both were far. 



It may be argued that the within-category effect of the transitions reflected 
merely changes in the confusion probabilities of alveolar and velar stimuli 
(cf. Table 1). However, the dichotic effects were only slightly reduced after 
a correction was applied that took changes in confusion structure into account. 
Moreover, the transitions of labial consonants (which were rarely confused; see 
Table 1) had a very pronounced effect. 
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Analysis of variance of the responses with the place of the voiced (voice- 
less) stimulus yielded a highly significant effect of the transitions of the 
voiced stimulus (Fi 7 = 27.17, p < .002), but only a marginally significant 
effect of the transitions of the voiceless stimulus (Fi 7 = 4.79, p < .07), with 
no significant interaction between the two. Thus, the former was more reliable 
than the latter, which again indicates that the transitions of voiceless stimuli 
were weak in their perceptual effect. 

There were some consistent deviations from the pattern in Table 4, which 
are in part responsible for the relatively small -average effect. Labial-velar 
pairs, especially /pa/-/ga/, received more labial responses when the velar 
transitions were far than when they were close. Pairs containing alveolar con- 
sonants, on the other hand, conformed to the predictions, despite the inverted 
pattern of place confusions in monaural presentation (see Table 1). 

Within-Category Feature Interacti ons 

It has been pointed out above that the response pattern in Table 2 cannot^ 
be explained by independent competition on the two phonetic dimensions (phonetic 
feature independence). The question of feature independence may also be asked 
within phonemic combinations (auditory feature independence) : Did within-cate- 
gory variations in VOT affect competition on the place dimension, and did within- 
category variations in the formant transitions influence competition on the 
voicing dimension? 

Responses with the place of the voiced (voiceless) stimulus did not vary 
significantly as a function of VOT. However, a more detailed analysis showea 
that the VOT of the voiceless stimulus did have a significant influence in some 
individual stimulus combinations. The largest of these effects was in /ba/-/ka/ 
and consisted in a decrease in labial responses and an increase in velar re- 
sponses as the VOT of /ka/ changed from +40 to +55. This effect is in agreement 
with the prototype model which predicts a certain amount of positive correlation 
between features: as a stimulus moves closer to its prototype along one dimen- 
sion, its overall Euclidean distance from the prototype is reduced, and other 
dimensions will indirectly benefit from this increase in category goodness. 

Voiced (voiceless) responses showed a significant effect of the transitions 
of the voiceless stimulus (Fi^7 = 22.61, p < .003). Voiced responses were more 
frequent when the voiceless transitions were closer to the boundary, which is 
again in agreement with the prototype model. The (nonsignificant) effect of the 
transitions of the voiced stimulus, however, was not in the predicted direction. 
It was also surprising that the voiceless transitions affected competition on 
the voicing feature more than competition on the place feature. 
* " 

The prototype model also predicted variations in t;he proportion of blend 
errors (and correct responses) as a function of joint variation in both stimulus 
dimensions. Correct responses were expected to be most frequent (and blend re- 
sp^i-.-.-s least frequent) when the two competing stimuli were farthest apart in 
pe ■. 'ptual space—when they were closest to their respective correct prototypes. 
The opposite result was predicted when the two stimuli were closest in percep- 
tual space, and thus almost as close to the blend prototypes as to the correct 
prototypes. This hypothesis was most easily tested by considering only the 
acoustically most similar and the acoustically most dissimilar pair within ench 
phonemic contrast. (For example, in /ba/-/ta/, the most similar pair would de 
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/ba/ with high F2 transitions and VOX = +15 paired with /ta/ with low F2 transi- 
tions and VOX = +40, while the most dissimilar pair would be /ba/ with low F2 
transitions and VOX ^ 0 paired with /ta/ with high F2 transitions and VOX = 
+55.) Of the six phonemic contrasts, only one supported the prediction, while 
four showed differences in the opposite direction. Overall, blends were more 
frequent when the competing stimuli were acoustically dissimilar . Xhis is in 
contradiction to the prototype model. However, the result is in agreement, and 
indeed a consequence of, the earlier observations that variations in the formant 
transitions had a relatively small effect, and that blends tended to accompany 
voiceless responses which increased greatly in frequency as VOX changed from +40 
to +55. 

4 

Ear Dominance 



Xhe present experiment offered a first opportunity to apply an improved 
method for calculating an unbiased index of ear. dominance recently proposed by 
Repp (1976a, 1976b), Xhis new index takes into account the variations in stimu- 
lus dominance by applying the methods of signal detection theory and fitting a 
receiver-operating-characteristic (ROC) curve to the data points for individual 
stimulus pairs. Xhe index is a linear transformation of the area under the ROC 
function (cf. Green and Swets, 1966), and it ranges from +1 for a perfect REA to 
-1 for a perfect left-ear advantage. Its derivation and its advantages over 
other indices are discussed in a separate paper (Repp and llalwes, in prepara- 
tion) . 

Xhe calculation of the unbiased ear advantage index presupposes that the 
responses can be grouped into two exhaustive categories. Double-feature con- 
trasts present a problem here, because of the large proportion of blend errors 
which are ambiguous with respect to ear dominance. At present, it is not c r 
how a valid index could be derived from the responses at the phonemic level. 
However, the problem can be circumvented by separately considering the two fea- 
tures, voicing and place. Ear dominance indices for voicing only are easily 
calculated by classifying the responses as voiced and voiceless, iguorin ^ ^^e 
place feature. Xhese indices (and the correspondiii^ ROC function) were ^ i on 
24 data points, representing the four VOX combinations for each o.c the 
phonemic contrasts, ignoring variation in the transitions. The results <n 
shown in the first column of Xable 5. 

Similar indices were calculated for the place dimension by dichor.omizlcig 
the responses, using the same grouping of place categories ai in the ^iarlier 
data analysis. Each index was based on 24 data points, representing the four 
transition combinations for each of the six phonenic contrasts-, ignoring varia- 
tions in VOX. Xhese indices are show in the second colutjx*: cf Xable 5, Xhe 
third column of Xable 5 shows the same indices, but omitting the eight data 
points for alveolar-velar contrasts. 

Table 5 shows that there was a highly significant average REA. Except for 
one subject on the voicing dimension, all subjects showe' REAs. The mosc 
striking result is the magnitude of these effects. The average REAiS, e.e y^ll as 
most of the individual coefficients, are several magni?:udes larger tna:ii the 



The terms, ear dominance and ear advantage, are us£:d iut:<?i changeably here , si 
though the former is more appropriate within the singj.a-response paradigm. 
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TABLE 5: Individual ear advantages [unbiased coefficients based on the method 

described in Repp (1976a, 1976b) and Repp and Halves (in preparation)]. 



Sub j ects 


Voicing 


19 1 r3 r« A 

jr xace 


PI a.-oS 
IT Xauc 


JiS. 


a.n 1 7 
tU • X / 




. xu 


JL 


+0.73 


+0.52 


+0.64 






+0. 57 


+0 . 76 


MR 


+0.57 


+0.82 


+0.89 




-0.09^^ 


+0.35 


+0.35 




+0.90 


+0.76 


+0.78 




+0.47 


+0.14 


+0.26 




• +0.75 


+0.81 


+0.98 


Average 


+0.55 


+0.51 


+0.60 


BHR^ 


+0.96 


+0.55 


+0.64 



Omitting alveolar-velar contrasts. 
^Claimed a 5-dB hearing loss in the right ear* 

^Not significant. All other coefficients are sigTiificant at p < .05 
or better [estimated according to the procedure ou^tlihed in Repp and 
Halves (1976)]. 

^Lefjt-handed (WT for vriting only). * 

Data for the author as a subject; average of three sessions. 



advantages reported in earlier stuaies of normal subjects. (In fact, several 
subjects show REAs close to the possible maximum.) There are two possible rea- 
sons why these indices are so large. One is that some conventional indices, 
such as the Phi coefficient (Kuhn, 1973; Repp, 1976b), underestimate the "true" 
size of the ear advantage. For example, the average Phi coefficient on the voic- 
ing dimension was +0.30, which is only about half the size of the unbiased index 
of +0.55. However, this Phi coefficient is still very large compared to those 
in earlier studies, which required the subjects to give two responses [for ex- 
ample, Shankweiler and Studdert-Kennedy (1975), who reported an average Phi of 
+0.06]. The reason for this difference may be that the single-response paradigm 
adopted here eliminates much of the noise that is present in two-response data 
and therefore reveals the true magnitude of the ear advantage. There is much to 
be said in favor of this argument (see Repp ar-^ ^alwes, in preparation). However, 
Repp (1976b) reported an average Phi coeffici*. * of only +0.06 in a single-re- 
sponse experiment with completely fused syllaL.^es that contrasted in place only. 
Clearly, there must be an additional factor beyond the resporAse requirements and 
the kind of index used. Although previous studies have not indicated a substan- 
tial difference in the REA for completely fused and partially fused syllables, 
the present results suggest strongly that such a difference exists; it perhaps 
was obscured by guessing responses in earlier studies requiring two responses.^ 



^It may be noted that none of the four subjects (JK, JL, and two new listeners)^,, 
who received detection instructions (footnote 2) showed a large REA on the ^p lace 
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A comparison between the second and third columns in Table 5 shows that, 
for all subjects but one, exclusion of alveolar-velar pairs led to an increase 
in the ear dominance coefficient on the place dimension. This finding illu- 
strates an important methodological point: pairs of stimuli that are highly 
conf usable will tend to show a reduced ear advantage . It follows that high in- 
telligibility of the stimuli in a dichotic test is an important requirement, 
and that pairs of confusable stimuli should be omiuted from consideration when 
the ear advantage is determined. 

Finally, the indices for voicing and place (columns 1 and 3 in Table 5) may 
be compared. While the average indices are similar, there are substantial in- 
dividual differences. Some of these may be auc to chance, but the larger dif- 
ferencizs (and especially that for BHR, the author, whose results are based on 
2,304 responses) are certainly real. It must be concluded that, for a given in- 
dividual, the REA on the voicing dimension is not necessarily the same as on the 
place dimension. Underlying these differences may be individual differences in 
the perceptual representation of the speech sounds and of, their dimensions (for 
example, in the structure of the subjective perceptual space). This points to a 
substantial problem in measuring the "true" or "physiological" ear advantage, 
which we are only now beginning to understand. Future research will have to deal 
with the possibility of interactions between hemispheric dominance and perceptual 
organization in individuals. 



The present study demonstrates clear effects of within-category acoustic 
variations on dichotic stimulus dominance relationships. This finding consti- 
tutes conclusive evidence against a simple phonetic feature recombination model, 
as outlined in the Introduction. It also renders insufficient a more elaborate 
version of this model incorporating the concept of inherent phonetic feature 
strength. Rather, the competitive strengths of phonetic feature values are prob 
ably a direct function of the acoustic stimulus structure, and changes in the 
latter lead to changes in the former. Thus, dichotic interaction does not take 
place at a strictly phonetic leVel, but at an earlier stage where auditory in- 
formation is stili preserved in some form. 

The prototype model provides one possible conception of this auditory rep- 
resentation. According to this model, the dichotic inputs converge in the form 
of multicategorical vectors, a stage intermediate between continuous auditory 
and discrete phonetic representation. The multicategorical stage embodies the 
relationship between the variable auditory input and the more or less fixed 
phonetic categories. It has proven useful in conceptualizing the process of 
dichotic interaction and fusion (Repp, 1976b, in press) which so far has been 
considered only in terms of the auditory-phonetic dichotomy (Studdert-Kennedy, 
Shankweiler, and Pisoni, 1972; Pisoni, 1975; Cutting, -1976; Studdert-Kennedy. in 
press). However, the prototype model was only moderately supported by the j.rvh.- 
ent data. Below, we will briefly summarize some of its shortcomings, conslcer 
some alternative models, and jpresent some theoretical arguments in favor of 
ji*aintaining the prototype model as a working hypothesis. 



dimension, and JL showed a marked reduction in her REA. The coefficients for 
these subjects were +0.20, +0.08, and +0.12, respectively (alveolar- 

velar pairs included). 



GENERAL DISCUSSlbN 
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On the whole, the main prediction of the prototype model was confirmed: a 
dichotic stimulus tends to gain in competitive strength if its acoustic struc- 
ture is changed so that it moves closer to its presumed correct prototype and 
away from category boundaries. However, there were two major exceptions: the 
inverted effect of a change in VOT from 0 to +15 when the competing stimulus had 
a VOT of +55 (Table 3) , and the inverted effect of a change in the transitioiis 
of velars when paired with labials (mentioned in connection with Table 4). Both 
effects are very difficult to rationalize, but there is no doubt about their 
reality. A follow-up study of dichotic competition along the VOT dimension has 
revealed even more bizarre interactions. Note that they cannot be explained by 
atypical stimulus characteristics (such as synthesis artifacts) or by different 
assumptions about the location of the prototypes in perceptual space. For ex- 
ample, it has been implicitly assumed that VOT = 0 is closer to the voiced pro- 
totype than VOT = +15, and that VOT = +55 is closer to the voiceless prototype 
than VOT = +40. However, if the obvious hypothesis is introduced that the pro- 
totypes represent the modal production values of the corresponding articulatory 
dimensions, the first part of the assumption is probably false: VOT = +15 is 
closer to the modal production value than VOT = 0, at least for alveolars and 
velars (Lisker and Abramson, 1964; Klatt, 1973; Zlatin, 1974). However, even if 
this were true — and the data permit this interpretation as well as the opposite — 
it could not explain the interaction obtained; all that would change is the part 
of the interaction which is considered anomalous. (Note also that the VOT inter- 
action was exhibited by all six phonemic combinations and thus was apparently 
independent of place of articulation.) 

There is little value in discussing the several other respects in which the 
prototype model has failed. Instead, it seems useful to consider alternative 
models that perhaps could account for the anomalous findings. Unfortunately, 
however, the most obvious candidates make rather similar predictions and do not 
fare better than the prototype model. 

It is possible, for example, to consider a pur^ "auditory averaging model." 
This model would assume that the dichotic stimuli are integrated at a strictly 
auditory level of ;.rocessing, so that a single stimulus, a kind of auditory 
average of the two components, is phonetically interpreted. . In the present con- 
text, this model makes predictions that are quite similar to those of the proto- 
type model, but in other contexts differential predictions can be generated and 
the auditory averaging model has been found insufficient (Cutting, 1976; Repp, 
1976b, in press). It is quite possible, however, that some auditory interaction 
is involved in addition to integration at a higher, multicategorical (and, per- 
haps, even phonetic) level. Such a multilevel model of dichotic interaction 
would be of considerable complexity, but it is not clear whether it could explain 
the anomalies in the present data. 

Another alternative model that deserves some discussion is the "feature 
detector model" which currently enjoys some popularity (Eimas and Corbit, 1973; 
Cooper, 1974; Cooper and Nager, 1975; Miller, 1975, 1976; Studdert-Kennedy , in 
presp) . This model assumes a separate set of detectors for each feature, with 
one detecnor corresponding to each value of a feature (Eimas and Corbit, 1973; 
Cooper, 1974; Miller, 1975). Effectively, this places the prototypes at the 
level of auditory analysis. Dichocic interaction may be conceptualized as fol- 
lows: each stimulus passes through separate banks of feature detectors and 
emerges as an array of multicategorical feature codes (that is , as a multicate- 
gorical matrix) . These matrices then converge upon a single processor where they 
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are averaged. Subsequently, separate feature decision mechanisms select the 
largest detector response for each feature, and finally these categorical feature 
values are combined into a percept or response. Thus, each feature or dimension 
has ii:.^ 0^*,^' little perceptual space and its own set of prototypes. 

The orfiiH.: ctions of the feature detector model are again rather similar to 
those or the prototype model, except that, in its simplest form, the former 
assumes mutual independence of individual features. There are several instances 
in the present data where this assumption must be rejected, so that rather com- 
plex ad hoc assumptions about the interrelations among feature detectors and 
among feature decisions would have to be introduced. The prototype model, on 
the other hand, predicts specific interdependencies between different features; 
some of them were supported by the data but others were not. The data therefore 
do not permit a choice between these alternative models. However, given that 
they are equally well (or equally poorly) supported, there are some theoretical 
reasons why the prototype models mi^^ht be preferred as a working hypothesis. 

The voicing and place features of stop consonants are among the best exam- 
ples of "integral" dimensions (Lockhead, 1972; Garner, 1974). One cannot exist 
without the other, and selective attention to one feature is impossible without 
taking the other feature into account. In fact, there is strong evidence that 
the whole CV syllable is an integral unit oi processing (Pisoni and Tash, 1974; 
Wood and Day, 1975^. Integral units are multidimensional, and their dimensions 
interact during processing. The feature detector model can deal with such inter- 
actions only by some rather strenuous assumptions which, typically, are made 
post hoc and often are based on assumptions of serial processing, which are in- 
appropriate with integral dimensions (Garner, 1974). The prototype model, by 
virtue of its multidimensional Euclidean structure, naturally incorporates such 
interactions, and it makes predictions that can be quantified and falsified. 
Moreover, it is somewhat counterintuitive and uneconomical to assume a separate 
categorical decision for each feature, subconscious as these decisions may be. 
A single phonetic decision is more in line with subjective experience and cer- 
tainly more parsimonious. 

Lockhead (1972) has discus:^Ted similar problems with respect to visual 
stimuli. His views are worth quoting here, since they apply to speech stimuli 
as well. 

A distinctive feature must be a set of attributes considered in rela- 
tion to all stimuli; one cannot have distinctive features in a 
vacuum.... We must determine the space, the set of relations, and 
not just the features, if we are to understand pattern recognition. 
The basic hypothesis is that observers first locate an object in some 
complex psychological space and then analyze that locus according to 
the needs of the task. . . . Perhaps a distinctive feature can be de- 
fined as an attribute(s) , or the value of an attribute(s) ? of a stim- 
ulus whi v.. causes that integral object to be distant from other po- 
tential stimuli in the psychological space.... [This] directs atten- 
tion to the possibility that the relations between attributes (which 
is another way of saying locus in space) may be processed before the 
values of the attributes themselves are processed, (Lockhead, 1972: 
417-418, his emphasis.) 
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The prototype model is very much in line with Lockhead's views. It adds the 
assumption of category prototypes, a concept that has been useful in various 
other areas of perception (e.g., Posner, 1969;. Reed, 1972; Rosch, 1973; Smith, 
Shoben, and Rips, 1974; Hyman and Frost, 1975) but has been neglected in models 
of speech perception [except perhaps for the work of the Leningrad group; see 
Galunov and Chistovich (1966) and Galunov (1968)]. Thus, the prototype model 
has considerable heuristic value, and much more evidence will have to be col- 
lected before it can be confidently rejected. The achievement of the present 
study lies primarily in the rejection of the overly simple phonetic feature re- 
combination model; its contribution to the evaluation of the prototype model re- 
mains modest. 

The second important result of the present study is the magnitude of the 
ear advantages obtained. It suggests that the single- response paradigm, to- 
gether with the unbiased ear dominance index (Repp, 1976a, 1976b; Repp and Halwes, 
in preparation) is a powerful method for assessing laterality effects, and that 
it is probably one step closer towards an optimal dichotic test for diagnostic 
purposes. 
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Distance Measures for Speech Recognition — Psychological and Instrumental* 
Paul Mermel stein 



ABSTRACT 

Perceptual confusion amcPHg speech sounds can serve as a guide 
to the selection of appropriate distance metrics for verification of 
hypotheses in speech-recognition systems. Knox^m results covering 
psychological representation of speech sounds are first reviewed. 
Desirable properties for distance measures fof verification are 
stated, and previously proposed distance nietrics for word-recognition 
are evaluated in this light. This paper reports on one experiment 
that demonstrates the need for assessing the significance of local 
differences by any distance metric to he used for verification of 
syllable-siz;ed hypotheses concerning the speech signal, 

INTRODUCTION 

Analysis of the continuous speech signal to obtain a phonetic transcription 
is a significant problem for. any ^Peech-understanding system. Speech sounds 
undergo a complex reorganization of their acoustic properties, from their form 
when uttered in isolation, to their form in a Sentence context. This reorgan^ 
ization is generally accompanied by a loss of infonnation; distinctive differ- 
ences among sounds become reduced and sometimes disappear altogether. 

Analytic segmentation and labeling rules niay be constructed to extract the 
segments of speech that are characterized by unchanging features (Mermelstein, 
1975). Due to variations in context and speaker, however, these rules are at 
best probabilistic in nature, as they only select a highly likely hypothesis 
concerning the underlying segments . The rule5 are based on acoustic measurements 
pertaining only to a short-time interval of the signal in and around the hypoth- 
esized segment. 

To utilize information from a somewhat la^^ger context, one attempts to 
verify the analysis-derived hypotl^eses at the syllable or word level. Word 
boundaries are not readily apparent in fluent speech; therefore one wants to 
consider the verification of syllable-sized units. By restricting our analysis 
to admissible syllables of the language, both those found within words and those 
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spanning word boundaries, we can immediately reject a large number of hypotheses. 
Additionally, knowing the syllable context, we can utilize predictions concern- 
ing the effects of neighboring sounds on each other in order to ascertain 
whether the data in fact support those hypotheses. 

We first review some results concerning human perceptual confusions among 
speech sounds in order to select an appropriate representation on which to com- 
pute distance measures. Next, several desirable properties are cited for a dis- 
tance metric appropriate for the verification of syllable-length hypotheses. 
Distance measures previously used for limited word-recognition systems possess 
these properties to a variable extent. Distance-based recognition is generally 
inappropriate for selecting one of more than a few hundred distant patterns. 
For a fixed finite probability of error for any individual membership compari- 
son, the recognition probability tends to zero as the number of patterns is in- 
creased. Therefore, we^suggest that analysis be used to select only a few rea- 
sonable hypotheses concerning the phonetic content of a syllable, and convention- 
al word-recognition techniques be limited to verification of such hypotheses. 
In order that a metric be appropriate for verification as well as recognition, we 
require not only that the distance to the correct category be a minimum, but also 
that such minima lie below a fixed threshold, and distances to incorrect cate- 
gories lie above that threshold. Finally, we cite a simple experiment whose re- 
sults emphasize the need for weighting the short-time spectral distances accord- 
ing to the ii;igalElcance of the local differences. 

'F-gychological Distance Representation 

Experimental data on confusion among speech sounds by human listeners are 
available from perception and recall experiments. Miller and Nicely (1955) 
measured perceptual confusions among single initial consonants under various 
conditions of noise added to the speech signal. Wickelgren (1966) measured con- 
fusion among consonants that were perceived correctly in a serial recall experi- 
ment. The confusion patterns were generally similar. Essentially the same fea- 
ture system could explain the confusions in auditory perception as in short-term 
memory. Where confusion exists, it can be viewed as the result of selective 
substitution of features such as voicing, nasality, openness, and place. Simi- 
larity among consonants was found to be a monotonic function of the number of 
features they share. Where confusion among consonant-vowel and vowel-consonant 
sequences was tested, the order was not significant for vowel errors but x^as a 
feature of consonant errors. 

Shepard (1972) derived a similarity matrix from the Miller-Nicely confusion 
data and obtaine ' a spatial representation of the speech sounds. He assumed 
that similarity ib an exponentially decreasing function of interclass distance 
and minimized the error between the similarity and its distance derived repre- 
sentation , 

-bD . 4 o 
I {S - (e + c)}2 

l>j ^ . . 

Sij ~ ^Pij + Pji)/(Pii + Pjj) ^ function of the reported confusion matrix. 
D^. is the distance between classes i and j in the spatial representation 



92 



recovered, given by / ^jk^ ' ^^^^^ projection of the coor- 

t h ^ tti 

dinate of the i class on the k orthogonal dimension of the underlying per- 
ceptual space. Parameters to be determined are b and c. Over 99 percent of the 
variance for confusion among 16 consonants was accounted for on the basis of two 
orthogonal dimensions. These dimensions corresponded roughly to the perceptual 
features of voicing and ' combination of nasality and frication. 

This spatial representation is shoTO in Figure 1. A hierarchical cluster- 
ing procedure which sequentially clusters sound pairs in the order of their 
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Figure 1 removed duo to copyright restrictions. (Spatial and hierarchical 
representation of the perceptual similarity between consonants. From Shepard, 
1972, McGraw-Hill, Inc.) 
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similarity yields the clusters indicated. These clusters roughly correspond to 
those one derives on the basis of confusions at decreasing levels of signal to^ 
noise ratio. There appears to be a good correlation between the similarity 
values under different noise conditions — decreasing signal to noise increases 
the confusion among similar sounds. 

It is significant to note that the sound space -^S'^not uniformly populated. 
A distance sufficiently large to cross the boundary between /p/ and /k/ is prob- 
ably not significant for variation among different tokens of /s/. The technique 
relies on confusion data; therefore, the distance between distinct tokens of 
members of the same phonemic category is assumed to be zero. Since any contin- 
uous instrumental measure must be sensitive to both intercategory and intracate- 
gory variation, these results can only be used as a guide to the construction of 
an appropriate distance metric. 

A similar spatial distribution can be achieved for vowel sounds and is 
given in Figure 2. Although the data are shown in three dimensions, which 
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Figure 2 removed due to copyright, restrictions . (Three dimensional spatial 
representation for 10 vowel phonemes. Frcxn Shepard, 1972, McGraw-Hill, Inc.) 
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account for 99 percent of the variance, the first two dimensions account for 97 
percent. While the principal dimensions correspond roughly to the first two for 
mant frequencies of the vowels, the second dimension appears to be compressed 
roughly logarithmically with frequency. These results correlate well with known 
data concerning the spacing of critical bands in the human auditory system — the 
band within which noise effectively masks a signal of fixed frequency. These 
critical bands are about equally spaced with frequency below 1000 Hz, increasing 
logarithmically thereafter. The mel-f requency scale reflects that spacing. 

Confusion between vowels and consonants seems quite rare, but no data are 
available. It is unfortunate that the semivowels and glides were not included 
in the Miller-Nice.ly confusion experiments since these would have yielded the 
most interesting consonant -vowel confusion data. 

Compound consonants present additional problems. Despite the close fusion 
in articulation between the component consonants of a compound, the confusions 
of the compounds can be explained in terms of the confusion of the components 
(Pickett, 1958). This result may be due to phonological constraints among the 
compounds. Since stops and fricatives are relatively rarely confused-, the 
classes of compounds in which they participate will also be rarely confused. 
Confusion predominates among the stop-liquid compounds in initial and the nasal- 
stop group in final position. 

According to Wickelgren (1966) consonant similarity and vowel similarity 
can be considered as independent dimensions in syllable recall. However, co- 
articulation effects modify the acoustic cues for consonants, depending on the 
syllabic vowel. Therefore the possibility of perceptual' interactions between 
consonant, and vowel must be recognized. 

Desirable Distance Measure Properties 

In view of the above results, a distance measure that models human per- 
formance should ideally recognize the phonemes, and construct the distance mea- 
sure from phoneme conf usability data. Failing such recognition, we can at best 
approximate the peripheral, precategorical aspects of human speech perception 
behavior . 

Let us postulate a set of desirable properties for a distance measure for 
the verification of syllable-sized segments. 

1. The measure should operate on time-aligned versions of the tokens 
to ensure consonant-to-consonant and vowel-to-vowel comparison. 
Since syllables have but one prominent vowel, the best aligned 
tokens can be viewed as those that will minimize vowel-vowel dif- 
ferences as well as differences in the prevocalic and postvocalic 
position. 

2. If the final distance measure is a time integral of some distrib- 
uted distance function, an appropriate weighting function that 
assesses the significance of the contributions from the individ- 
ual short-time segments must be used. 

3. The distance measure between tokens should be symmetric, D(X,Y) = 
D(Y,X). 
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4. It should be possible to utilize the distance measure to deter- 
mine phonetic' equivalence. If X and Y are phonetically equiva- 
lent, but X and Z are not, the D(X,Y) < D(X,Z). 

5. Let A,B be parametric representations of two tokens, then 
M(A,B)'= M(B,A) = (A+B)/2 is a template for the class (A,B) such 
that D(A,M) <_ D(A,B) and D(B,M) <_ D(A,B). 

Templates are used as compact descriptors for equivalence classes. Consid- 
er the class of metrics defined as the weighted sum of elemental metric compo- 
nents for short-time segments. Let P be some space of time-warping transforma- 
tions such as shown in Figure 3 : 

D(X,Y) = min I w(t) d [x( r) ,y (t) ] 
P(t)e;P t 

where dCx) = d[x(T), yCx)] is an elemental metric component over a short-time 
segment of the path p(t) that maps 1 1 1 Tx, and 1 1 ty < Ty onto x and w(t) 
is some positive seraidefinite weij>,hting function that assesses the significance 
of the contribution from each elesient of the path. 

Among requirements that we may want to impose on the elemental distance 
metric between any two short-time segments are 




Figure 3: Typical path in the space of tijjie-alignment transformations between 
two speech segments. 
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. 4: spectrograms for the reference words "l«munlty" (top). 

(Lttom). and unknown (right). Frequency in kHz units, time in 
increments of 12.8 msec. 97 
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1. positive semidef inite, d(x,y) >^ 0 (this ensures that the global 
metric is also positive semidefinite) , 



2. synmetric, d(x,y) = d(y,x), 

. 3. that it satisfies the triangle inequality d(x,y) + d(y,2) >^ 
d(x,z). 

4. that it satisfies a perceptual weighting of the frequency com- 
ponents of the power spectra of the signals. If variation in 
5(^2), the energy at frequency w^, is perceptually more signifi- 
cant than that in 5(0^2^* then d[x, x + As (w^) ] > 
d [x, X + As ((02) ] . 

The need for careful assessment of the significance of spectral variations 
was realized when we carried out the following experiment (Nye, Copper, and 
Mennelstein, 1975). Human spectrographic pattern recognizers were asked to 
match the words of an unknown sentence, presented in spectrographic form, with 
the same words from a reference library of spectrographic patterns. The refer- 
ence library was generated by the same speaker, stored in computer retrievable 
form, and displayed through specification of a list of required features. Since 
the phonetic transcription of the reference words was not made available, the 
subjects wide discouraged from using syntax and semantics to assist the pattern 
matching operation. While the subjects had no problem in rejecting the phonet- 
ically dissimilar words, they encountered frequent confusions between similar 
words. Figure 4 shows the two reference words "community" ai 4 "immunity" at 
left, and the unknown word at the right. In the presence of £> "-me uncertainty 
concerning the word boundary, the disagreement in the unstressed syllable at the 
top just to the right of the first arrow was accepted by two observers in view 
of the wide agreement over the rest of the word. The region of significant 
spectral disagreement between the two extends for no more than 100 msec. Clearly 
we need a rather sophisticated metric tc resolve such distinctions. 

Acoustics Based Distance Measures 

Let us now examine some distance measures proposed previously in the light 
of these requirements. Sokoe and Chiba (1971) constructed an Euclidean distance 
metric on short-time spectral samples obtained from a bank of band-pass filters. 
When the words were aligned in time through use of a d3mamic programming algo- 
rithm to minimize the total word-to-word distance, they achieved 99 percent rec- 
ognition of the 100 two-digit Japanese numbers of five speakers. Klatt (1976) 
has proposed weighting the spatial distance metric with a function that reflects 
the increased perceptual importance of differences near the spectral peaks, and 
reduced perceptual importance of the differences near spectral minima. Itakura 
(1975) suggested use of the minimum prediction residual as a distance measure 
for isolated word recognition. This measure computes the ability of the linear 
predictor that is optimum for the reference-word segment to predict the signal 
waveform of the target-word segment, 

d(x/a) = log (aVa'/aVa') 

That is, the distance between the target segment characterized by process X and 
the reference segment, having the optimum linear-prediction vector a, is given by 
the log-likelihood ratio where a is the optimum linear predictor of X, and V is 
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the vector of autocorrelation coefficients of X. While this measure can be com- 
puted rather quickly from the signal waveform, it is not sytmnetric between ref- 
erence and target. To overcome this. Gray and Markel (1975) have suggested a 
symmetric modification of the linear-prediction residual, namely 

dg(X/a) - d(X/a) + d(a/X) . 

The linear-prediction residual is a measure of the unpredicted signal energy. 
There is no attempt to assess the significance of the suboptimum prediction of 
the signal waveform. For some signals even a rough spectrum approximation 
appears adequate, for others a finer representation is required. 

White and Neely (1975) performed a comparative evaluation of the Euclidean 
spectral distance measure and the one based on the linear-prediction resi<iual. 
He found them roughly equivalent in terms of performance for recognition of a 
36-word and a 9l-word vocabulary of one speaker. They concluded that the major 
improvement over previous results arose from the use of the various dynamic pro- 
gramming algorithms for word alignment. Use of the dynamic prograinming tech- 
nique for word recognition was first, proposed by Velichko and Zagoruyko (1970). 

Atal (1974) has used a non-Euclidean distance measure for speaker recogni- 
tion, namely 

dCjil* ^2^ " ^^1 ' ^l^^'^^ ^^1 " ^2^' 

where the jMj are parameter vectors to be selected and W is the covariance matrix 
of _M. He explored representations in terms of linear-prediction coefficients, 
impulse response coefficients, autocorrelation function samples, predictor de- 
rived area functions, and cepstral parameters. The cepstral coefficients cj^ are 
related to the linear-prediction parameters by 

I e ^ - ln[a/|A(e'' ) |] 

k=-n 

where 1/A(e^^) is the linearly predicted signal spectrum and a is the rms ener- 
gy. Among the different parametric representations, the cepstral coefficients 
gave the highest speaker identification accuracy. Representation in terms of 
cepstral coefficients has the advantage that a set of coefficients of the same 
order can be averaged, and the result equals the cepstral representation of the 
average of the log power spectra (after normalization to unity gain) . Use of 
the covariance matrix normalizes the contributions of the components of the 
parameter vectors independently of any linear transformations they inay undergo. 

Bridle and Brown (1974) used a set of 19 weighted spectrum-shape coeffi- 
cients given by the cosine transform of the outputs of a set of nonuniformly 
spaced bandpass filters. The filter spacing is chosen to be logarithmic above 
1 kHz and the filter bandwidths are increased there as well. We will, therefore, 
call these the mel-based cepstral parameters. Pols (1971) showed good w^od 
recognition results using only the three shape variation components maximally 
contributing total spectral shape variation. These components resemble the mel- 
based cepstral parameters rather closely in terms of their frequency variation. 
The mel-based cepstral parameters have the advantage that generally fewer param- 
eters suffice for an adequate representation of the power spectrum than the 
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linear-prediction coefficient aeries • A truncated cepstral representation cor- 
responds to a frequency-smoothed power spectrum, one from which evidence con- 
cerning the individual harmonics of the speech signal is missing. To the extent 
that the spectrum of the excitation signal is invariant between successive 
voiced segments of the speech signal, the mel-based cepstral measure corresponds 
to a mel-welghted summation of the difference between the two smoothed vocal 
tract transfer functions. 

Ex periments _With a Mel-BaS ed Ce jstral Distance Measure 

X have been concerned with the adequacy of a mel-based cepstral distance 
measure to discriminate phonetically similar words and syllables. To evaluate 
the contribution of time-dependent significance functions to an integrated dis- 
tance measure, I conducted the following experiment: four speakers, two male, 
two female, recorded one production of each of the twelve phonetically similar 
words, ''sticW'sick,'' ''sklt,^' "spit,'' "sit," "slit,'' "strip," "scrip," "skip," 

''skid," "spick," and "slid" In a reference context "say again." The words 

were excised from the carrier by listening to a specifiable delimited segment of 
the signal. Spectra were computed ' for all the words and reduced to a two-dimen- 
sional cepstral respesentation. The respective interword distances were deter- 
mined for all possible pairs of words by time alignment with Itakura's dynamic 
algorithm. The unweighted metric used was 

«i(a,b) = i I I [C^(t) - cJ(T)]2 
t=l k=l,2 

Ck(T), T = 1, N; X = a,b; k = 1,2 are the time-aligned, two-dimensional, mel 

based Cepstral coefficient vectors for the two words. Figure 5 shows histograms 
of the Interword distances for the same word spoken by two different speakers, 
as well as for all other Pairs comparing different words spoken by the same or 
different speakers. The complete overlap between the two comparison categories 
is surprising. Although the unweighted distance measure is useful to differen- 
tiate Phonetically distant words, it is clearly not applicable to the discrimin- 
ation of phonetically similar vords. 

I next generated templates for each of the words by time warping the words 
of each speaker onto the one with longest duration using the same dynamic pro- 
gramming algorithm. The mean and variance of the first two cepstral parameters 
were next computed for the time-aligned versions and used as templates represen- 
tative Of the respective words. Next the weighted distance between each token x 
and template A was determined using the inverse of the variance for weighting 
each cepstral coefficient difference, for example, 

A T=l k=»J-,2 

The time-alignment path, t = 1, . . . , is now a function of the local cepstral 
variance, [a^(t)]2. 

A fixed distance threshold allowed the correct assignment of all but 2 of 
the 48 tokens to the appropriate word class. The two confusions arose^^through 
incorrect assignment of one token of "slit" to "sit" and one token of spit to 
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Figure 5: Histograms of computed interword distance values for (a) words from 
the same category (different speakers) , (b) words from different 
categories (same or different speakers). 



''spick.'' The same tokens were used to generate the template and to test them; 
therefore, this represents a- biased test of discriminability . VJhen I attempted 
to generate templates from fewer tokens, editing; problems near the word edges, 
such as whether the release of the final stop was properly included, the result 
was significantly poorer discrimination. Nevertheless, the dramatic difference, 
as compared to the use of unnormalized distances, underlines the necessity of 
including appropriate modeling of the significance of the encountered variation 
of the parameters. 

One result of using the inverse of the parametc. variance for the weighting 
function, is to assign more significance to silent segments where the variance 
was actually zero (assigned a finite nominal value), than to the segments having 
finite energy. Since all our tokens began with the phoneme /s/, we could not 
explore the question of the relative weights to be assigned to fricatives and 
voiced sounds. Presumably, the relative cepstral distance among the class of un- 
voiced fricatives is larger than that among the vowels. Therefore one would want 
to tolerate larger differences in fricative regions .than in'vowellike regions 
before rejecting a given hypothesis. 

A further desirable property of a time dependent weighting function appears 
to be the assignment of larger weights to regions of high spectral variation than 
to stationary regions. Otherwise, for tsteady-state segments the contributions to 
overall distance are proportional to the durations of the segments. Under those 
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conditions vowel differences would be overemphasized. No experimental results 
are as yet available on this point. 

Discussion and Conclusions 

Synthesis represents an alternative technique for generating the reference 
templates. Klatt (1975) and Cook (1976) have proposed a word verification pro- 
cedure based on synthesis of the hypothesized word. Its use offers large poten- 
tial savings in storage requirements at the costs of a small increment in pro- 
cessing requirements. 

The prime motivation of using templates derived from actual productions at 
this point is the need to establish quantitatively the amount of speaker and 
context dependent variation for which verification techniques must provide. 
While synthesis procedures generally give us a "perceptually acceptable repre- 
sentative of the class to which the token may be assigned, they provide no in- 
formation concerning the admissible variation in the individual parameters. As 
we gain more insight into the relative significance of short-time variations in 
speech spectra and achieve an ability to model the process adequately, synthesis 
will undoubtedly become a more cost-effective procedure for the generation of 
templates. Until that time, however, one must resort to the generation of tem- 
plates from actual productions in the exploration of hypothesis verification 
techniques. 

Our attempt to utilize insights from speech perception processes as an aid 
to improved speech verification techniques suffers from an inability to separate 
the peripheral and central processes in human speech perception. There remains 
a large gap in our knowledge concerning the transformations that the signal un- 
dergoes before the segmental information is extracted. We do not yet have an 
adequate model of the extent of acceptable variation among tokens that belong to 
a segmental equivalence class. Nevertheless, known properties of perception may 
be used to guide us toward perceptually relevant representations of the speech 
signal. We have some evidence that improved verification results are obtainable 
by focusing on those representations of the speech signal which have proven to 
be of interest for human speech perception. 
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Laryngeal Timing in Consonant Distinctions* 
Arthur S. Abramson 



ABSTRAC T 

The concept of voice onset time (VOT) is reviewed with attention 
to recent misunderstandings. Although it was procedurally convenient 
and linguistically interesting to focus for some time on word-initial 
stop consonants, VOT is properly viewed as a particular manifestation 
of a more general phenomenon, laryngeal timing. 

The timing of the valvular action of the larynx may be said to be a phys- 
iological mechanism that underlies such acoustic phonetic features as the onset 
and offset of voice pulsing, intensity of plosive release, amount of aspiration 
"'^ioise, attenuation of the first formant, onset of voice-excited forraant transi- 
tions, and perturbations of fundamental frequency. These features intersect 
in various combinations to furnish the phonetic basis of phonologically relevant 
voicing and aspiration. ^ These features also seem to cover most instances of 
the vaguely defined terra "tense" or "fortis," as applied to consonants. 2 

In our early approach to these matters (Lisker and Abramson, 1964, 1965; 
Abramson and Lisker, 1965),^ Leigh Lisker and I focused our attention on stop- 
consonant distinctions in word-initial position. For our cross-language in- 
vestigations, this choice made sense, because the richest sets of contrasts are 
most often found in initial stops. We hypothesized that temporal variations in 



*Under the editorship of Ceiia Scully and Gunnar Fant, this is to be published 
as one of a group of papers based on the seminar on "The Larynx and Language" 
held at the Eighth International Congress of Phonetic Sciences, Leeds, England, 
17-23 August 1975. 

Also of University of Connecticut, Storrs. 



The phonemic use of voiced aspiration is not fully handled by lar3mgeal timing 
alone; it also requires a dimension of glottal aperture. 

2 

For example, in English and Spanish. 

In this short review, I shall cite mainly work done in collaboration with a few 
of my colleagues. Certain references needed to document controversial matters 
and theoretical points will also be given. • • 

[RASKINS LABORATORIES: Status Report on Speech Research SR-47 (1976)] 
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glottal settings for phonation would differentiate most homorganic consonants 
said to be distinguished phonologically by such features as voicing, aspiration, 
and tensity. Since, in those days — and to a great extent to this day—it was 
difficult to make extensive physiological observations of the action of the 
larynx, we used instrumental displays of the acoustic signal for anaJ.ysis. The 
most convenient acoustic index to the closing of the glottis for phonation in 
initial position was the beginning of regular vertical striations corresponding 
in a wide-band spectrogram to the quasi-periodic voice pulses of speech. We 
proposed the term voice onset time (VOX) which we defined as the temporal rela- 
tion between the onset of glottal pulsing and the release of the initial stop 
consonant. Specifically, voicing detected before the release, that is, during 
the stop occlusion, was called voicing lead, while voicing starting after the 
release was called voicing lag. 

By and large, we found that VOX is indeed a very good index to laryngeal 
timing for the types of homorganic stop consonants in question. The measure 
provided rather good separation for labial, dental, alveolar, retroflex, and 
velar stops across a variety of languages that have two or three distinct classes 
at each place of articulation (Lisker and Abramson, 1964, 1967). Adopting the 
convention of assigning a timing value of zero to the moment of stop release, 
negative values to voicing lead, and positive values to voicing lag, we found 
an essentially trimodal distribution of VOX values for eleven languages that 
were examined. Xhe first mode centers at -100 msec for a range of values rep- 
resenting voiced unaspirated stopc, Xhe second mode centers af +10 msec and 
corresponds most generally to voiceless unaspirated stops. Xhe third mode cen- 
ters at +75 msec and corresponds to voiceless aspirated srops. Voicing lag, 
seemingly occurring for the most part with an open glottis, was regularly ac- 
companied by turbulent excitation of the upper vocal tract (aspiration) ; in 
addition, attenuation of the first formant was often visible in the spectrogram 
[for example; F-j^ cutback (Liberman, Delattre, and Cooper, 1958)]. 

It is clear then that VOX is not defined as an acoustic continuum, although 
it may be viewed as an articulatory or physiological continuum. In using tech- 
niques of speech synthesis to validate our findings, we varied values of voicing 
lead and voicing lag, with the latter including increments of cutback of the 
first formant and noise excitation of the upper formants. With stimuli simu- 
lating labial, apical, and dorsal CV syllables, we demonstrated the perceptual 
efficacy of the VOX dimension across a few languages (Abramson and Lisker, 1965, 
1970a, 1973; Lisker and Abramson, 1970). 

Since this work has stimulated many studies on the part of others, grati- 
fyingly too numerous to list here, it is important to stress that psychological 
and linguistic discussions of VOX should not give the impression that it is an 
acoustically simple dimension. It is radically different from many other con- 
tinua in the literature in that there is an abrupt qualitative discontinuity 
at the point of stop release. Discussions of special mechanisms for the pro- 
cessing of speech, feature detectors, and other related matters must make it 



4 



For those with a fourth laryngeal class, see footnote 1. Ejectives, not con- 
sidered here, may also be said to involve laryngeal timing; however, it is the 
timing of the tight closing of the vocal folds relative to oral closure that 
is relevant. 
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clear that we are dealing with profound psychoacoustic shifts. Voicing lead 
presents the ear with a low- amplitude, low-frequency spectrum during the initial 
part of the stimulus. In the absence of lead, we have the sudden full unfold- 
ing of the forraant pattern for the syllable. For appreciable values of voicing 
lag, the noise excitation of the formant pattern with its sudden shift to a 
train of voicing pulses has been shown by onr data to be psychoacoustically 
easier to process. 

I fear that our coining of the term voice onset time with its popular ac- 
ronym VOX, handy as it was for much of our research, has led some colleagues 
astray. A more appropriate concept is simply that of voice timing — that is, 
laryngeal timing — which subsumes VOX as a special case. Some scholars, finding 
VOX very useful for their purely perceptual speculations, have perhaps found lit- 
tle interest in our more psysiological endeavors, which, I think, put our acous- 
tic and perceptual data into proper perspective. Xransillumination of the lar- 
ynx (Lisker, Abramson, Copper, and Schvey, 1969), fiberoptic observations 
(Lisker, Sawashima, Abramson, and Cooper, 1970; Sawashima, Abramson, Cooper, 
and Lisker, 1970; Cooper, Sawashima, Abramson, and Lisker, 1971), and electro- 
myographic recordings combined with fiberoptic observations (Hirose, Lisker, 
and Abramson, 1972) all show that in running speech the dimension of laryngeal 
timing is a powerful differentiator of homorganic consonants. 

I cannot refrain from alluding to two serious misunderstandings of our con- 
cept of VOX. In a purported demonstration of the unimportance of VOX for English 
initial stops, Winitz., LaRiviere and Herriman (1975) manipulate the onset of 
voice timing, that is, the beginning of simulated glottal pulsing, as a complete- 
ly independent variable. Xhus, VOX values were altered in real speech record- 
ings in such a way as to yield improbable and even impossible temporal combina- 
tions and sequences of voice pulsing and aspiration. Using the resulting "syl- 
lables" as stimuli in perception tests, they claimed to show that aspiration 
is the major cue to voicing distinctions, while VOX is a secondary cue. Clearly 
these investigators have not grasped the central point that VOX is a physiolog- 
ical dimension which generates a complex set of intersecting, overlapping or 
even discrete acoustic cues. Xo take, for example, an original English /du/ and 
move the consonant burst back so that there is a silent gap of 35 msec between 
it and the onset of voicing (Winitz, LaRiviere, and Herriman, 1975:Figure 1) 
and say that this is the equivalent of a VOX value of plus 35 msec in conformity 
with the conventional model (Lisker and Abramson, 1964; 1971), is simply unten- 
able. An honest use of our concept and test thereof would reveaL that such a 
value of VOX would include turbulent excitation of the upper formants and atten- 
uation of the first formant. Xhese authors (Winitz et al, 1975) have the per- 
fect right to tease out any of the acoustic cues associated here with laryngeal 
timing, and perhaps others not yet mentioned, and to test the perceptual effi- 
cacy of any one of them, as has been done, for example, for the completion of 



After all, even chinchillas have been trained to perceive VOX differences 
(Kuhl and Miller, 1975). 
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formant transitions before or after the onset of voicing by Stevens and Klatt 
(1974)6 and the role of fundamental frequency by Haggard, Ambler, and Callow 
(1970) and Fujimura (1971). Although I readily concede that our terminology 
needs elaboration to cover the separate acoustic aspects of laryngeal timing,' 
it hardly behooves other investigators to cite us in denigrating VOT without 
reading closely to see that we mean much more than the mere timing of voice 
pulsing as a feature orthogonal to other consequences of laryngeal timing. 

The other recent instance of misunderstanding I have in mind is a study 
of voicing and aspiration in Hindi final stop consonants by Bhatia (1976). The 
author somehow interprets the work on VOT by Lisker and me (1964) and on the 
related matter of the size of glottal opening by Kim (1970) , to predict the 
neutralization of aspirated and unaspirated stops in final position. To the 
extent that certain statements by Kim may be vulnerable to Bhatia' s criticism, 
I have no wish to enter into the argument; nevertheless, degrees of glottal 
opening seem clearly relevant to the final distinctions in Hindi. Except for 
the special states of the glottis required for such features as murmur and 
creak, we would argue that the degrees of glottal opening needed for voicing 
distinctions including voiceless aspiration go with laryngeal timing. I must 
protest that here too an investigator (Bhatia, 1976) has failed to grasp the 
point that VOT is an utterance- initial manifestation of the more general phenom- 
enon of laryngeal timing. Indeed, one could go further and argue reasonably 
that word-final aspiration is an instance of voice onset time. Consider that 
in an English word like potato the unstressed first syllable is likely to have 
no voicing at all; that is to say, it is completely aspirated so that VOT prop- 
er does not take place until well after the beginning of the second syllable. 
The result is a voiceless vowel in the first syllable. This is a case, if you 
will, of a voicing lag so extreme as to deprive a whole syllable of voiced ex- 
citation. To produce aspiration in final position, it is necessary to release 
the stop, thus articulating an unstressed additional syllable (or perhaps 
"pseudosyllable'') . This additional unstressed "syllable" includes a nuise- 
excited vowel appropriate to the vocal tract configuration of the moment. 
Bhatia' s remarks on the predictive powers of phonetic theories (1976:73) are quite 
gratuitous! 

It must not be supposed, one early critic notwithstanding (Kim, 1965), that 
we have ever claimed that even in utterance-initial position the dimension of 
laryngeal timing will explain every distinction of homorganic consonants that 
apparently involves laryngeal features of one sort or another (Lisker and 
Abramson, 1964, 1971, 1972; Abramson and Lisker, 1970b). VOT may be said to 
distinguish the voiced aspirated (murmured) stops of such languages as Hindi 
and Marathi from voiceless stops but certainly not fron the voiced unaspirated 
stops. Here VOT intersects with the kind of glottal opening that permits weak 
but audible phonation to occur with simultaneous turbulence (Hirose ,et al. , 
1972) . For the three stop categories of Korean, VOT gives mixed results (Lisker 
and Abramson, 1964). In word-initial position, two of the categories show a 
fair amount of overlap although the two of them are well separated from the 



^Lisker (1975) clarifies the matter in experiments in which he pits a literal 
interpretation of VOT against "voiced transition duration." 

^Celia Scully: personal communication. 
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third. These data caken with the rather complicated response patterns of per- 
ceptual experiments with VOT (Abramson and Lisker, 1972) led us to conclude 
that the timing of glottal adjustments relative to supraglottal articulation 
does contribute to the Korean distinctions, but that there must be another 
dimension that works with VOT in distinguishing the stop categories. The lat- 
ter conclusion has been borne out by fiberoptic and electromyographic studies 
(Kagaya, 1974; Hirose, Lee, and Ushijima, 1974). 

Shifts in three extralaryngeal features are cpmmonly adduced in descrip- 
tions of the voicing distinction: the volume of the supraglottal tract, stop 
closure duration in medial position, and vowel duration before a final stop. 
For phonation to be sustained during an occlusion of the supraglottal vocal 
tract, it is necessary to prevent equalization of transglottal air pressure. 
Rothenberg (1968:91) calculates that without any special adjustment this equal- 
ization would occur in four msec, which would allow only one or two glottal 
oscillations. With passive expansion of the pharyngeal walls, voiced closures 
could be accommodated up to 20-30 msec (pp. 93-94). Active expansion of the 
pharynx, according to Rothenberg 's calculations (pp. 94-99) might give voiced 
closure durations of 80-90 msec. The even longer^ vsiced closure durations of- 
ten observed (Lisker and Abramson, 1964) might be explained by incomplete velo- 
pharyngeal closure (Rothenberg, 1968: 99-106) . ^ 

Expansion of the pharynx during voiced occlusions has been observed by a 
number of investigators, at least for citation forms. Apparently because of 
a conviction that English voiced stops are "lax" and voiceless stops, "tense," 
some of them, for example Perkell (1969), assumed that the pharyngeal walls 
expanded passively to help maintain the transglottal air flow for voicing, 
while the walls were tensed to prevent voicing for the voiceless stops. Elec- 
tromyographic examination of the relevant musculature (Bell-Berti, 1975; Bell- 
Berti and Hirose, 1975) reveals that one cannot predict for a given subject 
whether active or passive control, or some combination of the two, will be 
exercised for variations in the volume of the supraglottal cavity for voicing 
distinctions in English. The feature of pharyngeal expansion is linked with 
laryngeal timing, yet it may be independent. This is not known. For that mat- 
ter, we do not know how reliable the feature of pharyngeal expansion itself is 
in running speech. 

For some time (Lisker, 1957), it has been known that spectrograms of English 
medial voiceless stops before unstressed syllables show longer closure durations 
than do voiced stops, and that manipulation of this feature, providing that no 
voiced pulsing is present during the closure, furnishes a sufficient cue for 
the perception of the voicing distinction. Whether this feature is indepen- 
dent or somehow has a dependency relationship with laryngeal timing is not known 
at this time. Comparison of closure durations across all principal environ- 
ments, using oral air pressure traces (Lisker, 1972), shows that this feature 
is likely to be present only in medial poststressed positic. . and thus much less 
useful as an index to the voicing distinction than is laryngeal timing. 

The final nonlaryngeal feature to be considered here is the well documented 
observation that in English and some other languages, vowels preceding final 



See, for example, data for Sindhi (Nihalani, 1975). 
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voiced consonants are longer than those preceding final voiceless consonants. 
This durational difference is perceptually relevant (Denes, 1955; Raphael, 1972). 
One attempt (Halle and Stevens, 1967) has been made to tie this feature directly 
to the laryngeal control needed to maintain voicing during consonant closure. 
Since, however, voicing distinctions in final position are likely to be charac- 
terized by differences in laryngeal timing, namely voice offset time, the ques- 
tion arises as to whether the concomitant difference in vowel duration is com- 
pletely, independent of laryngeal timing. 

In order to distinguish classes of consonants, many languages make exten- 
sive use of the timing of the valvular action of the larynx relative to supra- 
glottal articulation. Certain nonlaryngeal features accompany laryngeal timing, 
but it remains to be determined whether any of them are controlled by the same 
mechanism. Laryngeal timing underlies a complex set of interrelated acoustic 
features any one of which may have perceptual efficacy. The total set varying 
rather predictably with changes in laryngeal timing has differentiating power 
in speech perception. The focus of attention for many years on utterance-initial 
position, reflected in the widely used term Voice Onset Time (VOT) , seems to 
have led some investigators to fail to understand VOT and its acoustic complex- 
ity as a positional manifestation of the more general phenomenon of laryngeal 
timing. 
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Phonetic Aspects of Time and Timing* 
Leigh Lisker^ 



ABSTRACT 

By a definition narrow enough to exclude acoustic and physiolog- 
ical aspects of speech behavior, phonetics is reduced to the descrip- 
tive practice of linguists, whose judgments on the physical nature of 
a speech signal are primarily auditory and sympathetic proprioceptive. 
These judgments are for the most part embodied in a special alphabet 
of indeterminate size, each element of which is defined with refer- 
ence to some particular state of the vocal apparatus. In general, a 
dimension of time is not included in the set of auditory and articu- 
latory properties by which the different states are specif 4-ed. Since, 
in all but a negligible number of cases, speech signals are said to 
not involve a single state of the vocal apparatus, but rather a se- 
quence of such states, this sequential ordering is explicit recogni- 
tion of a temporal dimension. But the time-ordered elements are 
themselves "timeless" unless the linguist determines that varying the 
duration of one or more of them serves to signal a semantic — that is, 
a linguistic — distinction. At this point, one of the two segments 
said to differ significantly in duration will often be judged to have 
a duration "inherently" determined by its other properties, while the 
other will be characterized as "long" or even "overlong." Aside from 
duration as a property ascribed to the segments constituting a speech 
signal, there are temporal aspects of speech that are less often 
given an explicit representation in the linguist's transcription; 
these are at best indirectly indicated by the so-called "junctural" 
marks and stress markers. One temporal aspect of speech that is reg- 
ularly ignored is the feature of rate of articulation, for within cer- 
tain ill-defined limits speech tempo is ad libitem. 

Let me begin with a preamble to explain my understanding of "phonetic as- 
pects of time and timing," in the present context. That understanding is to a 
considerable degree determined by a factor that is itself temporal, or at least 
temporal at one remove. I have in mind the spacial arrangement of the discus- 
sion titles on the program I was given, and my belief that it follows the con- 
ventions of written English and thus signals our chairman's wish that I speak 



*This paper was presented by invitation of the 100th meeting of the American 
Speech and Hearing Association, Washington, D.C., 21-24 November 1975. 
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first, with Katherine Harris, Dennis Klatt, and Peter MacNeilage to follow in 
that order. Therefore, I supposed, in preparing for this morning's business, 
that our discussion of the temporal organization of speech activity would begin 
with a consideration of certain aspects to be called "phonetic" and then go on 
to physiological and acoustical data and theories bearing on our topic. This 
order seems to Imply that physiological and acoustical data comprise What a^e 
in some sense nonphonetic aspects of the' speech process, and while I Prefer to 
think of phonetics as deserving a much broader definition, it is both convenient 
here, and regrettably close to the general practice of some language scholars in 
discussing language behavior, to restrict the scope of my contribution so as to 
exclude in particular the subjects which Katherine Harris and Dennis Klatt will 
be addressing. That leaves the perceptual aspect for me to talk about — itself 
a broad enough subject to include a good deal that many of us might want to 
exclude from phonetics. 

Since the scholarly caste that has for longest concerned itself With speech 
is the one of linguists, even though some who are called linguists would deny 
the study of speech activity a place in their discipline, I will for now take as 
reports on the phonetic aspects of speech timing those observations on the tem- 
poral properties of speech that linguists include, more or less systematically, 
in their language descriptions. Such observations, like most others deferring 
to physical properties of a language in its speech guise, reflect judgments by 
the observing linguist as to the physical attributes characterizing speech 
events as readings of particular strings of linguistic items. Physical proper- 
ties are most often defined in articulatory terms, sometimes in acoustic, but 
the judgments are almost entirely based on auditory input without overt reliance 
on any observational data obtained under laboratory conditions. That linguists 
phonetic judgments are to an extent based on such data seems undeniable, but it 
is not usual to find them informed by a knowledge of the latest laboratory find- 
ings. This is understandable when we remember that many linguists are not pri- 
marily interested in precise physical descriptions, but rather in devising 
spelling systems that meet certain criteria, only one of which is that its 
letters bear a statable relation to physically describable aspects o£ the 
classes of speech signals they are designed to represent. 

However, this does not make the linguist's phonetic transcription a fuHy 
explicit physical description. First of all. It represents speech by a linear 
array of discrete letters, so that as description it misrepresents speech in a 
serious way. Second, the physical properties represented by the transcription 
are primarily those to which distinctive function is attributed by the linguist; 
if some others are represented as well, this is, as Bloomfield (1933) put it, 
"due merely to chance observations. . .by an observer with a good ear," exercising 
a skill of "little scientific value." The linguist's representation embodies a 
partial physical description, but despite a possible implication of Bloomfield's 
comment, the disclaimer of completeness is no gesture of modesty. The linguist 
claims to know, from observing speech behavior in the interview situation, just 
what it is in the physical signal that the native speaker-hearer must produce and 
attend to in order that the signal be. correctly interpreted. The incompleteness 
of the physical specification is dictated by the linguist's assertion that not 
all features of the signal and the signal-generating activity are linguistically 
significant, and that the linguist's technique of observation and analysis suf- 
fices to identify those features that are. In one undoubtedly influential view, 
that of Chomsky and Halle (1968), it is asserted that it is indeed linguistically 
irrelevant whether the linguist's phonetic statements correspond to physically 
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attested fact; in their Sound Pattern of English a hypothetical speaker-hearer 
is invoked whose beliefs concerning the phonetic properties of his language are 
what a phonetic transcription should represent • We may be permitted to note, 
though only in passing, that the ideal speaker-hearer whose phonetic intuitions 
are to be represented seems to be very well aware of the acoustic and physio-^ 
logical studies to be found in MIT's Quarterly Progress Keports, and may even 
•b^ar a suspicious resemblance to one of the authors of The Sound Pattern of 
English. It might well be the Case, therefore, that the phonetic notions of 
this speaker-hearer are not immutable. 

At the heart of the linguist's practice of phonetic transcription, and 
serving as the principal bearer of his phonetic judgments or assessments of the 
ideal speaker-hearer's intuitions, is an alphabet of unknown but possibly finite 
st^e, to each letter of which is assigned a function as the referential of a 
physically defined set of VOcal-tract configurations or its acoustic conse- 
quences* In defining the value of each letter of this alphabet, reference is 
mad^ to a smaller set of parameters by which the state of the vocal tract is to 
some degree specified. In using this alphabet to spell a speech signal, succes- 
sive vocal-tract configurations are identified and appropriate letters are 
arranged in a linear left-to-right order, which corresponds to the temporal 
order of the observed vocal"tract states. Each state represented has a temporal 
order relation to every other state represented by the letter sequence, aind the 
expression of this temporal relation is obligatory. This is trivially so be- 
cause the only allowed spacial relation between letters is either left or 
right placement. Despite the fact that the letters stand for incomplete speci- 
fications of vocal- tract shape, tio two of them may be simultaneously applicable, 
each being appropriate for a unique and unspecified time interval. Or, if you 
prefer, the duration is specified as being equ:*! to that of one "segment," the 
duration of which is not further specified* Presumably each vocal-tract state 
represented is maintained oVer the duration of the segment, though it is not 
clear that this is necessarily the claiiB in aij^ cases. Only the order of suc- 
cession Of the different states is rfepresented—of m^.cessity — with one segment 
succeeding another without overlap and vithouu the intervention between any two 
iimnediate neighbors of a third requiting representation on linguistic grounds. 

In addition to the letters that represent temporal segments, there are 
others that have, along with grainmatical and intonational meanings, some sig- 
nificance as temporal markers* These are the several so-called juncture signs, 
as well as those indicating levels of stress. The juncture marks, which corre- 
spond very roughly to word-space and the punctuation marks of standard ortho- 
graphy, indicate places in the temporal sequence where, together with other 
phenomena, there may also occur ritardandos and even brief pauses, especially if 
they coincide with certain granrraatical boundaries. But none of these so-called 
suprasegmental indicators is exclusively or even primarily temporal in refer- 
ence, and demonstrations of the need to employ them in phonetic transcription 
generally focus on variations in pitch and loudness. Marks for stress, which 
for many linguists mean relative loudness, also have secondary temporal meaning; 
the presence of a mark of high stress usually can be taken to imply a local in- 
crease in segment durations* and, at least for English, the intervals between 
successive high stresses in a speech strtitch are said to be of roughly constant 
duration • Thus, the placeflient of high stress marks may be said to govern the 
relative tempo with which the segments are produced within the utterance, in the 
same way that the vertical lines marking off the measures of musical notation 
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tell us that all the notes of one measure are to be performed within a time span 
equal to that occupied by all the notes within any other measure in the same 
text. Modem musical notation is more explicit on the matter of timing, of 
course, and stress placement in musical performance is not rigidly tied to. the 
measure, but some phoneticians occasionally make use of the musical measure to 
represent temporal regularities observed in speech. In the view of many lin- 
guists, however, such regularities are not distinctive in language, and hence 
have no place in a phonetic transcription, however important they might be for 
the global characterization of the phonetic properties of speech generally, or 
of one language as against others. Except insofar as juncture and stress mark- 
ers provide some guidance to tempo, the task of performing a piece of phonetic 
transcription is very like that of the musician asked to sight-read an unfamil- 
iar piece from a medieval neumatic score, which indicates nothing of the indiv- 
idual notes but their relative pitch and sequencing. The lack of explicit tim- 
ing information or instructions has its advantages for both kinds of performance, 
allowing scope for individual variety of expression. For the performers of 
speech and music the freedom of choice implied by the notations is probably wide 
enough to permit readings of the same score that are different enough to convey 
different messages to a listener. For the musician a notation that fails to 
specify segment duration allows one kind of temporal latitude if the musician is 
a flutist— the segments can be given durations at will. For the slide-trombonist 
segment durations are also ad libitum, and there is the additional freedom to 
determine how rapidly to shift from one pitch to the next in glissando playing. 
Producing speech is more like playing the trombone than the flute, and phonetic 
transcription does not prescribe how rapidly the shift from one vocal-tract state 
to tlie next is to be accomplished. I have probably pushed the analogy much too 
far, for it is fair to object that musical notation is a set of instructions for 
performance more than a description, while phonetic transcription is more a de- 
scription than a performance. As a set of instructions, the phonetic transcrip- , 
tion will have an adequacy that depends, I suspect, less on its degree of spec 
ificity than on whether or not the "score" it presents is familiar to the read- 
er. Even if the score as a whole is novel, it must be made up of parts that are 
familiar if it is to be performed correctly. At the very least, the reader must 
be a practiced producer of fluent speech in order to implement the score as in- 
tended by the transcriber. 

As a model of speech, the linguist's graphical representation suffers from 
inadequacies that are well-known: a speech .signal does not consist of a se- 
quence of sounds, each fixed for some unspecified duration and separated from 
its closest neighbors by intervals of near-zero duration; but it is perhaps un- 
fair to charge the linguist with responsibility for such a model merely because 
his transcription practice seems to presuppose it. In fact, likely enough, the 
linguist is well aware that it is wide of the mark, and is only too ready to 
accept the contrary view of speech as a process, everywhere continuous, which 
possesses no properties that provide a physical basis for segmentation. The 
static definitions of vocal-tract states that he provides as interpretations of 
the transcription represent, then, outputs of a particular kind of sampling of 
this continuous signal, where the number of sampling points is determined by the 
number of perceived "change points" in the signal, but is pretty much indepen- 
dent of duration. In short, the transcription is the output of a special kind 
of "A to D" converter whose sampling rate is not temporally specified, the 
interval from one sampling point to the next depending roughly on when a per- 
ceived change in signal quality comes along. Instead of supposing the speech 
signal to consist of a succession of states, each maintained for some finite 

. .116 - 



117 



i 



time interval corresponding to the segment, one can instead say that each segment 
or letter of the transcription represents a state of the vocal tract which must 
be achieved or approximated within some time interval, and that this interval, 
though not necessarily the state which characterizes it, has both a finite dura- 
tion and a specified place in the temporal sequence of states. The duration 
over which the state characterizing the segment is maintained may or may not be. 
as great as the total duration of the segment, whatever that might be defined to 
be; the linguist as auditor will order segments with respect to duration, inde- 
pendently of what the laboratory phonetician may say about the duration over 
which the specified vocal-tract state is maintained. Because in fluent speech 
it is not unusual for that duration to be close to zero, it seems clear that we 
cannot hope to account for the linguist's judgments (and those of the rest of us 
as well) of segment duration simply by measuring the durations of steady-state 
intervals that might be discovered here and there in the speech signal. Since a 
good deal of the literature on speech timing is devoted to reporting durations 
of phonetic segments — when, in fact, what is being talked about are durations 
measured between acoustically specified change-points in the speech signals — a 
close relation between these measurements and the listener's judgments of seg- 
ment duration must be established before those measurements can be claimed to 
reflect phonetic aspects of speech activity, at least in the narrow definition 
of phonetics lam assuming at the moment. In other words, before we can justify 
referring to durations between physically specified events as equivalent to vow- 
el durations, for example, we must do what the psychophysicists did to establish 
the nature of the relation between pitch and fundamental frequency or to connect 
loudness with sound pressure level and frequency. In short, we must confront our 
old friend the segmentation problem. As we know, this is not so much a question 
of how to segment a signal, which is everjrwhere continuous, but rather where to 
cut the signal, amply possessed of discontinuities, so that the pieces derived 
can be claimed to correspond reasonably to the listener's segments. 

Let us look now at the linguist's representation of speech as a sequence of 
segments, defined by reference to states either aimed at or manifested by the 
vocal tract or the homunculus that runs it, with segment durations not specified. 
This reticence as to the temporal dimension of the segment is tacit admission of 
the freedom to perform what is linguistically the same speech piece with tempos 
varying over a considerable range; moreover, no claim is made that the relative 
durations of segments are constant with changes of tempo (Gaitenby, 1965). But 
is it in fact true that relative duration is never specified by the linguist's 
description? Of course not. In the description of some languages, the linguist 
finds it useful to distinguish members of a particular phonetic class with re- 
spect to a temporal dimension; for example. Thai is said to distinguish between 
short and long vowels (Abramson, 1962); for Estonian both vowels and stop con- 
sonants come in three grades of duration (Lehiste, 1970); English vowels are 
either short and lax or long and tense. Where a difference in length is consid- 
ered to be distinctive, the linguist may elect to represent the longer of a pair as 
a sequence of two like segments, thus by implication recognizing that a single seg- 
ment possesses one unit of duration. Sometimes, however, a special kind of seg- 
ment is devised, whose only characteristic is that it has the length of one seg- 
ment, all other properties being given by the specification of an immediate neigh- 
bor, usually the one directly preceding it. The long vowel or consonant involves, 
then, a particular kind of reduplication. Whether the observation that a par- 
ticular vocal-tract state is maintained sometimes for a longer interval and 
sometimes for a shorter one to be represented by one spelling device or another, 
has most often been decided by criteria not primarily phonetic in nature. 
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(Sometimes there may be some phonetic basis for asserting that the extra-long 
duration of an articulatory position must be analyzed as a sequence of repeated 
gestures.) 

Uncertainty as to the number of segments over which a single position is 
maintained is not the only problem encountered in dealing with a temporal dimen- 
sion at the segmental level; the same uncertainty may arise in connection with 
the evaluation of what are clearly recognizable sequences of — at least — one level 
of phonetic description. The notorious example of this is the case of stop- 
fricative sequences that may be accorded the status of one-segraent-long affri- 
cates if there seem to be strong phonotactic (that is, distributional) reasons 
to do so, in which case an especially close temporal relation between its se- 
quential components may also be discovered. There are other such examples: Are 
the so-called "prenasalized" stops of certain west African languages "really" 
one or two segments? Are the Russian palatalized consonants sequences of 
consonant and /y/? Sometimes, but apparently not very often, there is a genuine 
convergence of phonetic and phonotactic considerations; in Polish two linguisti- 
cally distinct stop-fricative sequences differ phonetically in ways that allow 
some justification for calling one of them a single segment and the other a 
sequence. Thus it would appear that recognizably different vocal-tract states 
in immediate succession are not invariably allotted to two segments with only 
one possible temporal relation; that relation may be characterized as one of 
"close" or "open" transition, or, in the case of vocal-consonant sequences, 
"close" versus "loose nexus," Now perhaps we should be inclined to look for and 
find differences in degree of coarticulation to support a particular answer to 
the question of "one segment or two?" 

Apart from cases where the linguist is forced to recognize a temporal fea- 
ture because it appears to play a linguistically distinctive role quite like the 
features by which vocal-tract shape is specified, there are occasions where con- 
textually conditioned variations in segment duration are recognized. The greater 
duration of vowels preceding voiced stops is marked in phonetic transcription, 
but that added duration (as compared with the durations of the same vowels before 
voiceless consonants) is not said to constitute another segment, and both the 
linguist and the phonetician are motivated to discover some basis, phonetic in 
the broad sense, for considering it to be a consequence of coarticulation. 
Similarly, the durational difference between the English vowels /i ,u/ and /i,u/ 
is ascribed to the laxness of the first pair as contrasted with the tenseness of 
the second. Similarly, the brevity of the apical flap of American English is a 
consequence of the small force of articulation exerted in its production. Some 
kinds of temporal variation at the level of the segment that have been reported 
appear to have escaped attention; for example, the greater durations of initial 
fricatives as. compared to final, or the greater durations of final nasals as 
compared to initial. 

Observations of this last kind, which relate relative duration to positToh 
within the segment sequence, are in effect, assertions that there must be postu- 
lated units larger than the individual segment- for which temporal regularities 
may be stated. The smallest of these is the syllable (only phoneticians who 
look at physiological and acoustic data worry about the organization of conso- 
nant-vowel and vowel- consonant sequences), a unit whose usefulness in phonetic 
description is acknowledged in the same measure as its resistance to definition 
.is deplored. Linguists tend to solve this problem by believing in the syllable 
as a phonotacti^ unit with no phonetic standing, while phoneticians incline to 
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describe it as the basic element of speech organization. In this view the posi- 
tional variation to which a phonetic unit conforms is, first of all, that of 
position within the syllable. In fact, it would seem that the durations of the 
segments composing a single syllable are changed sufficiently from their hypoth- 
esized "inherent" durations for the syllable to be the elementary temporal unit, 
with "inherent" or baseline durations assigned to segments defined purely with 
respect to their status within this unit. This, for the linguist, is so far 
from being a controversial statement that I think I might justly be charged with 
beating a horse that was stillborn; no linguist's discussion of speech timing 
has ever proposed a direct relation between utterance duration and the number of 
segments composing it. But a fairly direct relation between duration and sylla- 
ble number appears to be intuitively acceptable, whether or not linguists make 
an explicit statement on the matter. The acceptance of this relation underlies 
the practice of defining the somewhat elusive quality of speech that we presume 
is chiefly temporal in nature, namely speech tempo, as corresponding more to a 
measure of syllables than to segments per unit time. 

The same belief — I would suppose — underlies the distinction made between 
languages like Spanish, which exhibit this feature of "syllable timing," and 
languages such as English, whose contrary tendency to "stress timing" seems to 
require explanation as a departure from the expected. In English, we are told, 
the constant duration intervals into which an utterance can be analyzed are 
marked by stress (as has already been mentioned). In effect, this says that the 
durations of utterances are determined by syllable count, but not all syllables 
count. So far as I know, however, no one has proposed that speech tempo for 
English be equated with a measure of the duration separating adjacent stressed 
syllables. What has sometimes been reported (and this makes such a measure less 
appealing) is that syllables that are stressed at one tempo may be produced with 
noticeably reduced stress when tempo is increased, suggesting a tendency to keep 
interstress durations constant over a range of tempos. Of course, with all the 
importance that has been ascribed to the syllable as a unit of speech organiza- 
tion, both in production and in perception, it is remarkable that the linguist's 
writing system fails to represent this unit any more directly than do the more 
widely known alphabetic orthographies (and I suspect it would create at least as 
large a class of problem readers if put to more general use) . Perhaps the fact 
that linguists have followed the alphabetic rather than the syllabic model in 
their writing practice comes from the general exclusion of temporal aspects in 
specifying speech, but it does seem odd, nevertheless, that a fundamental unit 
is not explicitly represented. 

We come, finally, to the aspect of speech referred to as its "rhythmic" 
quality, which everyone seems to agree is an all-pervasive feature. From time 
to time linguists appeal to rhythm as a factor that determines stress placement 
in the case of, for example, lexical items like fourteenth , whose stress contour 
is variable with context; and linguists have sometimes, for example, character- 
ized languages as "machine- gun- like" in effect. The basis for this conviction 
that we all share — namely, that speech can be described as rhythmic and that it 
is profitable to discuss the temporal organization of the process without first 
deciding whether any exists — doesn't seem so obvious as to be undeserving of a 
final remark. It is this; if all speech is rhythmic, it is certainly true that 
some speech is more obviously rhythmic than other, the most well-regulated speech 
being the perfectly metrical performance of a child' s chant. or poetry reading. 
If prose speech differs from these in temporal organization, is the difference 
one of kind or degree of regularity in timing? The poet's art bends speech to 
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aesthetic effect; some of the stuff it fashions probably is unchanged from nature 
(that is, the phoneme stock), but some is creative transformation or even possi- 
bly additive. How much of our conviction that rhythm is a characteristic of 
natural speech represents a metricization, and how much a metrif ication, of the 
object of all our attention? I wish I might end on the note of that ringing 
question, but more soberly suppose that we shall learn, from studies that examine 
data and not just the sometimes stray observations of the linguist — that speech 
activity may be described as at least, or at best, "quasi-rhythmic" in nature. 
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Static and Dynamic Acoustic Cues in Distinctive Tones* 
Arthur S. Abramson"^ 



ABSTRACT 

It is conventional to classify phonemic tones into dynamic or 
contour tones and static or level tones. The perceptual relevance 
of this impressionistic dichotomy is considered here for Central 
Thai, which has two dynamic tones (falling and rising pitches) and 
three static tones (high, mid,, and low). A fundamental-frequency 
range appropriate to an adult male voice was used to synthesize 
three series of tonal variants on a syllable type available for five 
tonally differentiated words: (1) sixteen Fg levels at intervals of 
4 Hz, (2) sixteen Fg movements from a mid origin to end points rang- 
ing from top to bottom of the range in steps of 4 Hz, and (3) seven- 
teen variants rising from the bottom to end points from top to bottom 
in steps of 4 Hz. The stimuli were played to natiye speakers for 
identification. The results indicate that level variants contain 
sufficient cues for identification as static tones but with consider- 
able overlap. Identification, however, is enhanced by slow Fg move- 
ment. Rapid Fq movement is required for dynamic tones. Although 
imprecise, the typological dichotomy is useful. 

In a tone language, part of the specification of each morpheme or word is a 
distinctive pitch pattern. Although some tones may have additional phonetic 
features,! the major characteristics of a tone system are fundamental-frequency 
states and movements. 

Some linguists refer to level tones, which are heard as having no pitch 
movement, and gliding tones which audibly rise or fall (Pike, 1948). In 



*This is a slightly revised version of a paper presented at the 91st meeting of 
the Acoustical Society of America, Washington, D.C., 4-9 April 1976.. 

"^Also University of Connecticut, Storrs. 
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phonological analysis, the question may arise as to whether glides should be 
treated simply as whole pitch movements or as movements between level tones that 
are otherwise present in the system (Candour, 1975). Here I am more interested 
in the validity or usefulness of the distinction between gliding or dynamic 
tones and level or static tones. The question is examined in Thai, the official 
language of Thailand; 

Some years ago I published typical fundamental-frequency contours of the 
five tones of Thai, as shown in Figure 1 (Abramson, 1962). The tones are con- 




DURATION 



Figure 1: Average Fq contours of the tones of Thai on long vowels (from 
Abramson, 1962: Figure 3.6). 

ventionally labeled from top to bottom: high, falling, mid, rising, and low. 
Perceptual experiments with synthetic speech showed that these contours carried 
sufficient information for high intelligibility in the labeling of monosyllabic 
words. More recent experiments with the present formant" synthesizer of Haskins 
Laboratories have again, demonstrated the sufficiency of these contours (Abramson, 
1975a). Moreover, these findings provide a baseline for the experiments to be 
discussed here. Note that all the tones show at least some movement. The only 
one that may really be level is the mid tone because its final drop appears to 
be an intonational phenomenon before a pause. ^ In the experiments, I sought a 
basis for a division between dynamic and static tones in these curves. The fall- 
ing and rising tones with their abrupt changes in frequency showed considerably 



^Spealcers of Thai may find a prepausal mid tone without a final drop abnormal, 
but they identify such a contour nearly as well as the normal one (Abramson, 
1975a). 
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!T»nre movement than the others. I labeled the falling and rising tones dynamic 
a*.d the high, mid, and low tones, static. 

In the past few years, further acoustic analysis of the Thai tones 
(Erickson, 1974, 1976; Abramson, 1975b) has suggested that, especially in running 
speech, the static tones are not very different from the d3mamic tones. The 
high tone can be described as a high rising tone, while the rising tone can be 
described as a low rising tone. The low tone tends to fall to the bottom of the 
speaker's voice range and stay there, although this fall starts at a somewhat 
lower point than of the falling tone. It is only the mid tone that does not 
make extreme excursions into the high and low regions of the voice range, al- 
though it seldom has the ideal level shape of Figure 1. The following three 
experiments are intended to shed light on the perceptual validity of the dis- 
tinction between static and dynamic tones. 

A syllable of the type [kha:] was prepared on the Haskins Laboratories for- 
mant synthesizer. Sixteen variants were made by superimposing sixteen level 
fundamental-frequency trajectories ranging from 152 Hz down to 92 Hz in steps of 
4 Hz. Each stimulus had a flat amplitude except for a slight rise at the begin- 
ning and a slight fall at the end. In Test 1, these were played in several ran- 
domizations to 37 native speakers of Thai for identification as one of five pos- 
sible words. The question considered was the following: Do fundamental- fre- 
quency levels carry enough information for identification of the static tones, 
or must there be some movement for acceptability? The results in Figure 2 show 
that only the three static tones are used as response categories. . Note that no- 
where is 100 percent identification reached. A peak of 90 percent for the low 
tone is about the same as the peak shown in the baseline test (Abramson, 1975a) by 
the same subjects for the typical low tone displayed in Figure 1. The high tone 
at the left reaches a peak of only 88 percent compared with. 98 percent for the 
typical high tone in the baseline test. The mid tone in the middle reaches 73 
percent as compared with 82 percent in the baseline test. 

It is also true that all three tones elicit responses throughout the range. 
Most of the latter effect was caused by three subjects who used only two label- 
ing categories, high and low or mid and low. Even in isolated monosyllables, 
then, flat fundamental-frequency trajectories can elicit static-tone responses. 
For this to happen in natural speech, there must be some auditory accommodation 
to the speaker's pitch range as well as to the immediate tonal context. At the 
time of this test, the subjects had become used to the voice and frequency range 
of the synthesizer. Lack of Fg movement did cause some confusion for the sub- 
jects,- and for three of them it was rather disrupting. It is not surprising 
that the dynamic tones were not used as response categories. 

In Figure 3 we see the tonal variants used in Test 2. They all start from 
a common mid origin and end at the same points as in Test 1. I wondered whether 
:.the static-tone responses x^ould be increased by the moderate amount of movement 
in most of these variants and at the same time, whether at least the extreme 
values in the continuum would yield mainly dynamic responses. These stimuli 



The capable and efficient selection and supervision of the test subjects by 
Miss Panit Chotibut of the Faculty of Humanities, Ramkhamhaeng University, is 
much appreciated. The subjects were college students who were native speakers 
of the^^Central Thai dialect of Bangkok and its environs. 

123 



121 



Test 1: Fq Levels 

lOOr 




Frequency in Hz 

Figure 2: Identification functions for fundamental-frequency levels as static 
tones. 



were played to 31 of the original subjects, and the results are shown in 
Figure 4. A few stimuli at either end do indeed yield dynamic responses, but no 
greater than a peak of almost 14 percent for the rising tone at the high end, 
and almost 5 percent for the falling tone at the low end. Otherwise, the static 
tones are again the predominant responses. Except for the low tone, there is 
somewhat better labeling here. The high tone goes from 88 percent in Test 1 to 
94 percent in Test 2, and the mid tone improves from 73 percent to 84 percent. 
In fact, it is a slightly downward movement from 120 to 116 Hz that yields 84 
percent,^ while the flat variant at 120 Hz yields only 72 percent '5 it seems 
safe to say that fundamental-frequency movements increase the acceptability of 
synthesized syllables as static tones. For the low tone, a more appropriate 
movement would start somewhat lower in the voice range. 

In Figure 5 we see the variants for Test 3. All the variants start from a 
low origin at 90 Hz and reach the same end points as before except for a flat 



^This should be compared with the 82 percent for the mid tone of the baseline 
test (Abramson, 1975a). That stimulus did not slope downward from its onset as 
does the one described for Test 2 here, but it did have a final drop. 

^Compare it with the flat variant at 120 Hz in Test 1 which yielded 73 percent. 
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Test 2' Fq Contours 
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Figure 3: Fundamental-frequency contours from a mid origin. 



Test 2: Fq Slopes from Mid Origin 

lOOr 




Frequency in Hz 

Figure 4: Identification functions for the contours of Figure 3. 



125 



126 



Test 3: Fn Contours 




Duration in msec 

Figure 5: Fundamental-frequency contours from a low origin. 



variant ending at 90 Hz. In the test these 17 stimuli were played to the 
31 subjects for identification. It was expected that the sharply rising variants 
would be heard as a dynamic tone, namely the rising tone, with the others divided 
among the static tones with some preference for the low tone. The results are 
shown in Figure 6. With a peak at 91 percent, the rising tone is clearly favored. 
The low tone reaches a peak of 88 percent only at the very bottom of the range. 
It would be more convincing if it started higher and drifted downward. The 
third response category is the high tone which peaks at 38 percent. For this 
tone, a more appropriate movement would start higher. The mid tone which peaks 
at just under 12 percent, is negligible. 

We may conclude that fundamental-frequency levels do carry much information 
on the static tones, although they improve with movement. For the dynamic tones, 
as exemplified here by the rising tone, a rather abrupt movement is required. 
Other continua that bear on this question have been tested but are not yet ready 



For a reason that is hard to reconstruct, possibly no more than an oversight, 
the low point was set at 90 Hz instead of 92 Hz as in Test .2. It is not likely , 
that the downward shift of 2 Hz has any bearing on the outcome. 
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Test 3: Slopes from Low Origin 

lOOr 




Frequency in Hz 

Figure 6: Identification functions for the contours of Figure 5. 



for presentation. Although the dichotomy between static and d3mamic tones is 
imprecise and unstable, more so in production (Abramson, 1975b) than perception, 
it is still useful as a rough classification of tone production and as an index 
to the types of acoustic cues used in recognition pf tones. 
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The Effects of Selective Adaptation on Voicing in Thai and English 
S. L. Donald"^ 



ABSTRACT 

Native Thai speakers and native English speakers took part in a 
selective adaptation experiment. The stimuli were a labial series of 
25 stimuli from a voice-onset-time continuum. This series spanned 
three phonological categories for the Thai-speaking subjects but only 
two categories for the English-speaking subjects. The data suggest 
that three feature detectors mediate the perception of voicing con- 
trasts for the Thai-speaking subjects, whereas only two feature de- 
tectors appear to be active in the English-speaking subjects' percep- 
tion of voicing contrasts. Implications of this difference are con- 
sidered. 

Several selective adaptation experiments have examined the perception of 
the voicing distinction (for example, Eimas and Corbit, 1973; Cooper, 1974). 
The variable in the experiments was voice onset time (VOT) , or the interval 
between stop release and the onset of phonation (Lisker and Abramson, 1964). 
Eimas and Corbit (1973), as well as later investigators, have suggested that 
two feature detectors mediate the perception of voicing contrasts. Since native 
English speakers were used, these adaptation experiments were limited to the 
distinction between voiced stops and voiceless aspirated stops, with a continuum 
being tested ranging from 0- to 80-msec VOT. These detectors are hypothesized 
to respond to a slightly overlapping range of VOT values. The category boundary 
lies at the point at which both detectors respond with equal strength. Repeti- 
tive stimulation of either detector is said to fatigue that detector, resulting 
in weakened output. The unadapted detector will thus respond to boundary stimuli 
with relatively greater strength than the adapted detector, resulting in a 
shift in the phonetic boundary. 

Thai, in contrast to English, has three voicing categories: prevoiced, 
voiceless unaspirated, and voiceless aspirated stops. (This is true for the 
labial and alveolar places of articulation. Thai lacks a prevoiced velar stop.) 
Abramson and Lisker (1965) found the category boundaries here to occur at -20 
msec and at +40 msec in comparison to the single English boundary at +25 msec. 



"^Also University of Connecticut, Storrs. 
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Following Eimas and Corbit^s reasoning, three feature detectors uiight oper^ 
ate in the perception of three voicing categories, as found in languages like 
Thai. The purpose of the first experiment reported here is to investigate this 
issue. Native speakers of Thai were used as subjects. Three adaptation condi^ 
tions were presented: adaptation to a prevoiced stimulus, adaptation to a 
voiceless unaspirated stimulus, and adaptation to a voiceless aspirated stimulus* 
The VOT continuum examined here ranged from -80 to +70 msec. 

In a second experiment, native English speakers responded to the c^ame stim- 
uli with only two, rather than three voicing labels. The purpose of this expcdir--- 
iment was to examine the effect of linguistic experience by comparing the effect 
of adaptation with the two stimuli the English speakers labeled as voiced, with 
the effects these stimuli produced on Thai speakers for whom they were phonology 
ically distinct. 

EXPERIMENT I 

Subjects 

Five native Thai speakers of Central Thai (Siamese) served as subjects. 
Stimuli 

The stimuli were a labial series of 25 stimuli from a VOT continuum pryji- 
pared by Lisker and Abramson (1970) . The variations in VOT were produced by 
varying the onset of the first formant relative to the onset of the second axid 
third formants. During the absence of the first formant, the upper fa^mants. 
are excited by a noise source rather than by a periodic source. The V^T values 
ranged from -80 to +70 msec in 5 and 10 msec steps. Table 1 contains a list of 
the values of these stimuli. The adapting stimuli were -80 msec for t^^e pre- 
voiced adaptation condition, +5 msec for the voiceless unaspirated condition avA 
+70 msec for the voiceless aspirate condition. 



TABLE 1: Stimulus values. 



stimulus 


VOT value 


Stimulus 


VOT 


value 


0* 


-80 


msec 


13* 


5 


msec 


1 


-70 


msec 


14 


10 


msec 


2 


-60 


msec 


15 


15 


msec 


3 


-50 


msec 


16 


20 


msec 


4 


-45 


msec 


17 


25 


msec 


5 ~ 


-40 


msec 


18 


30 


msec 


6 


-35 


msec 


19 


35 


msec 


7 


-30 


msec 


20 


40 


msec 


8 


-25 


msec 


21 


45 


msec 


9 


-20 


msec 


22 


50 


msec 


10 


-10 


msec 


23 


60 


msec 


11 


-5 


msec 


24* 


70 


msec 


12 


0 


msec 









* denotes stimulus used as adaptor 
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Procedure 



Each subject participated in four experimental sessions. The first session 
consisted solely of an initial identification tape, in which each stimulus was 
presented 16 times in random order. The stimuli were presented in blocks of 15, 
with three seconds separating the presentation of each stimulus. Ten seconds 
separated the presentation of each block. Each of the adaptation sessions 
started with a short identification tape which presented each stimulus in the cont 
uum eight times in random order. Thus, a total of 40 responses for each stimu- 
lus was obtained in an unadapted condition. Under each adaptive condition the 
listeners were exposed to 60 presentations of the adapting stimulus (with an 
interstimulus interval of 30 msec). After the period of adaptation the subjects 
were asked to identify five stimuli. Each stimulus in the continuum was pre- 
sented eight times, in random order, for such postadaptation identification. 
Subjects responded with the three labial stops written in Thai orthography. 

Results 

The unadapted boundaries for both the boundary between the prevoiced and 
voiceless unaspirated stops, and the boundary between the voiceless unaspirated 
and voiceless aspirate stops were extrapolated for each subject from the pooled 
identification responses from all sessions. The boundary was defined as that 
point on the stimulus scale which would, by extrapolation, receive 50 percent 
responses from either category involved. The boundaries for the adapted re- 
sponses were estimated in the same manner. The boundary shifts are the differ- 
ences between the unadapted boundary and the adapted boundary. The results are 
displayed in Tables 2 and 3. The shifts predicted by the hypothesis of three 



TABLE 2: Thai subjects [b]-[p] boundary. 



Original Shift after Shift after Shift after 

Subject boundary [b] adaptation * [p] adaptation [ph] adaptation 

1 -25 msec -8 +1 +2 

2 -19 msec -2 +2 +1 

3 -18 msec -14 +10 -7 

4 -23 msec -11 +1 

5 -24 msec -9 +3 -4 



significant boundary shifts 



TABLE 3: Thai subjects [p]-[ph] boundary. 

Original Shift after Shift after Shift after 

Subject boundary [b] adaptation [p] adaptation * [ph] adaptation* 

1 +27 msec -2 -3 +6 

2 +22 msec -3 — +1 

3 +20 msec -4 -7 +11 

4 +28 msec -1 -2 

5 +30 msec -4 -6 +16 
*significant boundary shifts 
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feature detectors mediating the perception of the voicing contrasts occurred. 
With the [b] adapting stimulus, fewer [b] responses were obtained. With the 
[ph] adapting stimulus, fewer [ph] responses were obtained. The results for the 
[p] adaptation condition are somewhat less definitive. The [p]-[ph] boundary 
shift is fairly robust, and in the predicted direction: fewer [p] responses 
were obtained. The [b]-[p] boundary, on the other hand, is small. However, one 
subject who generally produced relatively large boundary shifts, did respond 
with a large boundary shift. Except for the [b]-[p] boundary shift after LpJ 
adaptation, all these boundary shifts are significant (by a t test for two re- 
lated groups, p < .05). Neither the [b]-[p] boundary after [p] adaptation nor 
the [p]-[ph] boundary after [b] adaptation are significant. 

EXPERIMENT II 

Subjects 

Four native American English speakers served as subjects. 
Stimuli 

The same stimuli were used in this experiment as were used in Experiment I. 
Procedure 

The same procedure was followed as was followed in Experiment I, except 
that these subjects responded with only two answers—voiced stops or voiceless 
aspirated stops. 

Results 

The data obtained in Experiment II were analyzed in the same manner as were 
the data from Experiment I. When subjects were adapted to the prevoiced stimu- 
lus fewer voiced responses were given. When subjects were adapted to the 
voiceless unaspirated stimulus, which they categorized as voiced, again fewer 
voiced responses were given. When subjects were adapted to the voiceless aspir- 
ated stimulus, fewer voiceless responses were obtained. All conditions pro 
duced significant boundary shifts (by a t test for two related groups, p < .03J. 
These results are displayed in Table 4. 





TABLE 4: 


English subjects [b]-[p] boundary. 


Subject 


Original 
boundary 


Shift after 
[b] adaptation* 


Shift after Shift after 
[p] adaptation* [ph] adaptation* 


1 
2 
3 
4 


+15 msec 
+15 msec 
+18 msec 
+10 msec 


-4 
-8 
-8 

-12 


-5 +8 
-2 +10 
-5 +9 
-7 +14 


^significant boundary 


shifts 
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Discussion 



The results of Experiment I suggest that Eimas and Corbit's (1973) original 
assertion — that two phonetic feature detectors mediate the perception of voicing 
distinctions — ought to be amplified by the addition of a third detector sensitive 
to negative VOT values. By this hypothesis one detector would be sensitive 
primarily to voiceless aspirated cues, a second would be sensitive to voiceless 
inaspirate cues, and the third to cues of prevoicing. The two boundaries between 
these three voicing distinctions would occur at those VOT values to which two of 
the feature detectors were equally responsive. 

A somewhat surprising result of Experiment I was that the voiceless aspir- 
ate detector was more resistant to adaptation than the other two detectors. 
Recall that in Eimas and Corbit's (1973) experiment, and also in Eimas, Cooper, 
and Corbit's (1973) findings, the voiceless detector was more susceptible to 
adaptation than the voiced detector. Similarly, for the English speakers of 
Experiment II, adaptation of the voiceless detector produced a more robust 
boundary shift than adaptation of the voiced detector. A possible explanation 
for this discrepancy is that adaptation takes place at both the auditory and the 
phonetic level. In a set of two experiments, the first involving place of ar- 
ticulation and the second voicing, Tartter and Eimas (1975) found that the 
greater the acoustic overlap between the adapting stimulus and the test contin- 
uum, the greater the adaptation effect. For example, the addition of first- 

formant, or steady-state information to adapting stimuli that contained all the 

relevant place of articulation information, produced a substantially larger 
boundary shift than was obtained by an adapting stimulus lacking this first for- 
mant. This fact defies the acoustic theorist. If the existence of some higher 
level feature is accepted, however, a clear explanation is possible: after 
adapting to a complete stimulus, with all three formants present, all three of 
the auditory feature detectors will have been fatigued. In contrast, after 
adapting to a stimulus lacking the first formant, only the F2 and F3 detectors 
will have been adapted. Ades (1976), however, objects to this proliferation of 
levels of detectors, saying that "In general, two strengths of adaptation do not 
necessarily indicate two levels of adaptation: it could be that there is just 
one level, more engaged by the full syllable than by parts of it."l Obviously 
this is not a resolved issue, and the present experimentation does not help in 
its resolution. In light of these claims, consider again the present discrep- 
ancy. 

Although adequate information is present in the stimuli used to allow the 
voicing distinctions to be perceived — according to Tartter and Eimas' s explana- 
tion — some information present in natural speech is not present. The VOT in- 
formation present in the stimuli is picked up by low-level detectors sensitive 
to certain aspects of voicing distinctions, which yield their output to higher 
level voicing detectors. These higher level detectors fail to receive input 
from low^level detectors sensitive to other acoustic features cueing voicing 
distinction in natural speech. The lack of adaptation of these low-level detec- 
tors accounts for the so-called "resistance" to adaptation in Thai and English 
subjects. That English speakers and Thai speakers' are resistant to adaptation 
by different adaptors is due to differences in the production of voicing dis- 
tinctions in the two languages. 



Ades, A. E. (1976) Adapting the property detectors for speech perception; 
preprint sent to author, p. 30. - 
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Ades's viewpoint allows an almost identical explanation, differing primari- 
ly in terminology. Here again the S3mthetic stimuli used in these experiments 
lack some acoustic information present in natural speech, for example, varia- 
tions in release-burst intensity. The lack of relevant information in certain 
stimuli would decrease the strength of adaptation. Furthermore, the contrast in 
strengths of adaptation in Thai and English subjects is due to differences in 
thci production of voicing distinctions in the two languages. 

At any rate, it is apparent that linguistic -environment has substantial 
effect on the development of feature detectors. First, as discussed above, the 
discrepancy of degree of adaptation in the Thai and English subjects indicate 
differences in the perception of the same acoustic stimuli by subjects from dif- 
fering linguistic backgrounds. Second, the present two experiments demonstrate 
that the perceptual boundaries between voicing categories are different in Thai 
and in English, confirming earlier work (Abramson and Lisker, 1965; Lisker and 
Abramson, 1970). Not only does Thai have one more category of. voicing than does 
English, but the exact location of the boundary common to the two languages is 
different; in English at approximately +15 msec, as opposed to approximately +25 
msec in Thai. 

This point is also supported by the fact that adaptation with a prevoiced 
stimulus produced a boundary shift between the voiced-voiceless categories of 
English-speaking subjects, but no boundary shift between the voiceless cate- 
gories of Thai subjects. This discrepancy is striking evidence that phonetic 
feature detectors are subject to effects of language detectors. 

The end result is clear: for English-speaking subjects, stimuli — that 
through different linguistic experience would be perceived as belonging to sepa- 
rate categories — are perceived as belonging to the same category, and when 
serving as adaptive stimuli, produce the same effects. What has happened to the 
hypothesized third detector? Perhaps through lack of stimulation the detector 
has atrophied. The evidence from studies of infants' perception of voicing con- 
trasts supports this view. 

Streeter (1976) investigated the discrimination of VOT by infants from a 
linguistic environment that distinguishes between prevoiced and voiceless unas- 
pirate, but not voiceless aspirated stops . She found that the..e infants discrim- 
inated both the prevocied/voiceless unaspirated distinction and the voiceless 
unaspirate/voiceless aspirated distinction. Lasky, Syrdal-Lasky , and Klein 
(1975), studying infants bom to Spanish-speaking parents, also found three cat- 
egories of discrimination comparable to those Streeter found. And yet the 
single-phoneme boundary separating Spanish stops corresponds to neither of the 
boundaries which Lasky et al. found. Apparently infants, like chinchillas (Kuhl 
and Miller, 1975), and adults discriminating among nonspeech stimuli differing 
in relative onset time (Miller, Pastore, Wier, Kelly, and Dooling, 1974; Pisoni, 
1976^) distinguish between three categories characterized by leading, simultan- 
eous, and lagging temporal events. 

A second alternative explanation for the disappearance of the third detec- 
tors, not incompatible with the firsts is that the detectors sensitive to 



Pisoni, D. B. (1976) Identification and discrimination of the relative onset 
time of two- component tones: Implications for voicing perception in stops; 
preprint sent to author. 
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prevoicing are present but that linguistic experience affects the labeling of 
the output of the detectors . ^ This alternative would also account for the vary- 
ing locations of the exact voicing boundaries in different languages. 

In surainary, the present experiments suggest the existence of a phonetic 
feature detector sensitive to cues of prevoicing. They also demonstrate that 
language learning has a subst^intial effect on the detectors mediating the per- 
ception of voicing contrasts. 
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ABSTRACT 



According to recent investigations, adult listeners perceive 
rise-time differences in both speech and nonspeech stimuli in a cate- 
gorical manner (Cutting and Rosner, 1974). Adults labeled sawtooth- 
wave stimuli as either plucked or bowed. The present study used the 
high amplitude sucking technique to explore the two-month-old in- 
fant's perception of rise-time differences for sawtooth stimuli. In- 
fants discriminated rise-time differences that marked off the differ- 
ent nonspeech categories but did not discriminate equal differences 
within either category. Thus, the present study shows that infants, 
like adults, can perceive nonspeech stimuli in a categorical manner. 



INTRODUCTION 

Considerable evidence indicates that many speech sounds are perceived ca' e- 
gorically. With these stimuli, subjects are no better at discriminating sounds 
than they are at differentially labeling them. This claim is supported by ex- 
perimental findings from a number of different paradigms including: (a) accur- 
acy (Liberman, Cooper, Shankweiler, and Studdert-Kennedy, 1967; Mattingly, 
Liberman, Syrdal, and Halwes, 1971; Pisoni, 1971, 1973); (b) reaction time 
(Pisoni and Tash, 1974); and (c) average evoked potentials (Dorman, 1974). 
These results contrast with those observed for a wide variety of nonspeech 
sounds, varying along such physical continua as frequency, amplitude, and dura- 
tion, for which the subject's ability to discriminate between stimuli far out- 
strips his ability to label them differentially (Miller, 1956). 
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There is also a growing body of evidence that shows that human infants are 
capable of discriminating speech segments on the basis of minimal phonetic cues. 
To date, infants have displayed an ability to perceive subtle differences in 
voicing (Eimas, Siqueland, Jusczyk, and Vigorito, 1971; Streeter, 1974; Eimas, 
1975b; Lasky, Syrdal-Lasky, and Klein, 1975), place of articulation (Morse, 
1972; Eimas, 1974), initial burst cues (Miller, Morse, and Dorman, 1975), and 
third format cues for the /ra/-/la/ distinction (Eimas, 1975a). Not only do 
infants make fine distinctions between speech sounds, but they do so in a cate- 
gorical manner (for example, they make interphonemic distinctions but not intra- 
phonemic ones). Further, Eimas (1974, 1975b) has shown that infants, like 
adults (Mattingly et al., 1971), perceive certain acoustic cues categorically in 
speech contexts but not in nonspeech contexts. On the basis of these findings, 
Eimas (1975b and elsewhere) has suggested that the actual mechanisms which un- 
derlie the categorical perception of speech may be part of the biological makeup 
of the human infant. 

Thus, speech appears to be perceived in a quite different fashion from non- 
linguistic auditory stimuli. However, several recent developments may require 
us to reexamine the claim that categorical perception is evidence for the dis- 
tinctive nature of speech perception. Categorical perception has now been ob- 
served in a number of instances of nonspeech sounds (Locke and Kellar, 1973; 
Cutting and Rosner, 1974; Cutting, Rosner, and Foard, in press; Miller, Wier, 
Pastore, Kelly, and Dooling, in press). In particular. Cutting and Rosner (1974) 
have reported categorical perception for nonspeech sounds varying in rise- time. 
They have explored the perception of rise-times in both sawtooth-wave and sine- 
wave stimuli (as well as for affricate-fricatives in speech) . Adult listeners 
usually reported that these nonspeech stimuli sound as though they were produced 
by a musical stringed instrument. Sounds with rapid rise- times (less than 40 
msec) were perceived as coming from a plucked string, whereas sounds with 
more gradual rise-times (greater than 40 msec) were perceived as being produced 
by a bowed string. The listeners easily identified the stimuli as either 
"pluck" or "bow." Moreover, the perception of these stimuli was categorical. 

In a related study. Cutting, Rosner, and Foard (in press) extended the 
findings for the sawtooth-wave stimuli by demonstrating selective adaptation 
effects with them. These effects were similar to those observed with speech 
stimuli (Eimas and Corbit, 1973) both in direction and degree of shift. More- 
over, as in the case of speech stimuli, adaptation shifts for the sawtooth stim- 
uli were greatest when the adapting stimulus shared all dimensions with the test 
continuum. 

Although the claim for the distinctive nature of categorical perception in 
speech has been weakened by these lines of research, there has been no indica- 
tion that infants might exhibit categorical perception for nonspeech sounds. In 
fact, Eimas (1974, 1975b) has reported that two- and three-month-old infants 
tend not to perceive nonspeech cues categorically. However, the cues which 
Eimas studied were acoustic features that adults do not perceive categorically 
(Mattingly et al., 1971; Miyawaki, Strange, Verbrugge, Liberman, Jenkins, and 
Fujimura, 1975). The sawtooth-wave stimuli employed by Cutting and Rosner (1974) 
would seem to be a better choice for such a test. Not only do adults perceive 
these sounds categorically, but rise-time is also an important acoustic cue in 
various contexts. Accordingly, the present study explored the perception of 
rise-time differences in sawtooth-wave stimuli by two-month-old infants. 
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METHOD 



Procedure 

Each Infant was tested in a mobile laboratory. The infants were placed in 
a reclining seat which faced a loudspeaker approximately two feet away. Each 
subject sucked on a blind nipple which one of the experimenters held in place. 

The experimental procedure was a modification of the high amplitude sucking 
technique developed by Siqueland and DeLucia (1969). For each infant, the high 
amplitude sucking criterion • and the baseline rate of high amplitude nonnutritive 
sucking were established before presentation of any stimuli. The criterion for 
high amplitude sucking was adjusted to produce sucking rates of 10 to 20 sucks 
per minute. After a baseline rate was established, the presentation of stimuli 
was made continuent upon the rate of sucking. If the time between criterion 
responses was two seconds or more, then each response produced one presentation 
of the stimulus, which had an average duration of 1050 msec, followed by 950 
msec of silence. If the infant produced a burst of high amplitude responses 
within this two-second interval, the timing apparatus was automatically reset 
and the two-second interval began again. 

The criterion for habituation to the first stimulus was a decrement in 
sucking rate of 25 percent or more over two consecutive minutes, compared to the 
rate in the immediately preceding minute. At this point, the auditory stimula- 
tion was changed without interruption by switching channels on the tape recorder. 
For infants in the experimental conditions, the change resulted in the presenta- 
tion of a second acoustically distinct stimulus. For the infants in the control 
condition, the channels on the tape recorder were switched but no acoustic 
change was made. The postshift period lasted for four minutes. The infants' 
sensitivity to the change in the auditory stimulation was inferred from compari- 
sons of the response rates of subjects in the experimental and control condi- 
tions during the postshift period. 

Stimuli 

The stimuli were sawtooth waves generated on the Moog synthesizer at the 
Presser Electronic Studio of the University of Pennsylvania. The four stimuli 
were synthesized at 440 Hz and differed solely in their onset characteristics. 
Amplitude envelopes reached maximum in 0, 30, 60, and 90 msec after onset. By 
0 msec rise- time, we mean that a stimulus reached maximum amplitude in one- 
fourth of a period. Previous research by. Cutting and Rosner (1974) indicated 
that adults easily label the rapid onset (0 and 30 msec) sounds as "plucks." 
The more gradual onset stimuli (60 and 90 msec) were easily labeled as "bows." 
The durations of the four nonspeech stimuli were 1020, 1050, 1080, 1110 msec, 
varying according to rise-time. The decay period of each stimulus was 1020 
msec. 

All the stimuli were prerecorded on three 30-minute audio tapes for presen- 
tation to the subjects. Tape //I (pluck-pluck) was composed of 0 msec rise-time 
stimuli on channel A and of 30 msec rise-time stimuli on channel B. Tape #2 
(bow-bow) was composed of 60 msec rise-time stimuli on channel A and of 90 msec 
rise-time stimuli on channel B. Tape //3 (pluck-bow) was composed of 30 msec 
stimuli on channel A and of 60 msec stimuli on channel B. 
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Design 



Table 1 shows the wlthln-subjects design for the present experiment. All 
subjects were seen for two experimental sessions. (Mean Interval between ses- 





TABLE 1: Design. 






Session A 


Session B 


Group 1 
(n=6) 


Pluck-bow 


Pluck-pluck 


Group 2 
(N=6) 


Pluck-bow 


Bow^bow 


Group 3 
(n=6) 


Pluck-bow 


NO CHANGE 



slons was 8 days; range was 5 to 14 days.) In one session, all subjects heard 
the pluck-bow tape. The other session differentiated the three groups of sub- 
jects. Subjects In Group 1 heard the pluck-pluck tape. Subjects In Group 2 
heard the bow-bow tape. Subjects In Group -3 were randomly assigned one of the 
four rise-time stimuli for the entire session (the NO CHANGE condition) . The 
order of sessions and the order of stimuli within a session were each counter- 
balanced. 

Apparatus 

A blind nipple was connected to a Grass PT5 volumetric pressure transducer 
which was coupled in turn to a Beckman Type RS Dynograph. An integrator-coupler 
provided a digital output of criterial high amplitude sucking responses. Addi- 
tional equipment included a 4-track Hitachi tape recorder with speakers, a 
Hunter digital timer, two relays, and a counter. Each criterion response ac- 
tivated the digital timer for a two second period or restarted the period. 
Auditory stimulation at a level of 72 ± 2 dB SPL was available to the infant 
whenever the timer was in an active state. 

Subjects 

The subjects were 18 infants, nine males and nine females. Mean age was 
eight weeks (range: five to ten weeks). In order to obtain complete data on 
18 infants, it was necessary to test 25. Seven infants were dropped from the 
study for the following reasons: two infants fell alseep prior to shift, three 
cried excessively prior to shift, and the mothers of two infants were unable to 
keep the second appointment. 

RESULTS 

Figure 1 displays the mean number of high amplitude sucking responses as a 
function of minutes and experimental groups. For purposes of statistical com- 
parisons, we examined each subject's rate of high-amplitude sucking during five 
Intervals: baseline minute, third minute before shift, average of minutes one 
and two before shift, average of minutes one and two after shift, and average of 
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minutes three and four after shift. Difference scores were caicuiatea ror eacn 
subject for the following rate comparisons: (1) acquisition of the sucking 
response: third minute before shift less baseline; (2) habituation: third 
minute before shift less average of last two minutes before shift; (3) dishab- 
ituation: average of first two minutes after shift less average of first two 
minutes before shift; (4) dishabituation during third and fourth minutes: 
average of third and fourth minutes after shift less average of last two min- 
utes before shift; and (5) rehabituation: average of first two minutes after 
shift less average of third and fourth minutes after shift. 

Kruskal-Wallis one-way analyses of variance (Seigel, 1956) were employed 
to determine if the data for the pluck-bow sessions could be collapsed across the 
threa experimental groups. No significant differences were observed between 
groups for any of the five comparisons [x^(2) ranged from 0.37 to 4.10]; accord- 
ingly the data for the pluck-bow sessions were collapsed across groups in 
further analyses. Additionally, Kruskal-W§,llis tests indicated no differences 
1x^(1) ranged from 0.03 to 0.92] for t\\e bow-bow and pluck-pluck subgroups, 
whose data were similarly combined for further treatment. 

Wilcoxon matched-pairs signed-ranks tests (Seigel, 1956) were used to an- 
alyze performance within each type of seesioii. The results of these analyses, 
presented in Table 2, indicated that in all sessions subjects acquired the 



TABLE 2: T-values for Wilcoxon matched-pairs signed-ranks test. 



Experimental session 

Pluck-pluck 
Pluck-bow or bow-bow NO CHANGE 
Comparison (n=18) (n=12) (n=6) 



Acquisition : third minute before 
shift versus baseline. 

Habituation : third minute before 
shift versus average of last two 
minutes before shift. 

Dishabituation : first two minutes 
after shift versus last two 
minutes before shift. 

Late dishabituation : third and 
fourth minutes after shift versus 
last two minutes before shift. 

Rehabituation : first two minutes 
after shift versus third and 
fourth minutes after shift. 



** p < .01 
* P < .04 

a indicates reliable decrease in sucking 



0** 0** 0* 

0** 0** 0* 

-1** -19 0*^ 

-52.5 23 4 

-12** -26 8 
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conditioned high^-amplitude sucking response and habituated the response prior to 
shift* However, only in the pluck-bow condition did subjects display a reliable 
increase in sucking after the shift. Moreover, these subjects showed a reliable 
increase in sucking during the first two minutes after shift followed by a reli- 
able decrease in rate between that period and the next two minutes, thus indi- 
cating rehabituation. By contrast, subjects in the other three conditions 
showed no evidence of any increase in sucking after shift. Subsequent analysis 
of the data for the pluck-pluck, bow^bow, and NO CHANGE sessions by Kruskal- 
Wallis tests indicated no reliable differences in the pattern of responding by 
subjects in these sessions. Randomization tests on wi thin-subjects data across 
conditions confirmed these findings. 

Butterfield and Cairns (1974) have reported that asymmetrical order effects 
are sometimes observed for speech stimuli which cross phonetic boundaries (a 
shift from a voiced to a voiceless stop producing greater dishabituation than 
from voiceless to voiced). We tested for such asymmetries with the present 
stimuli. None were discovered, as Kruskal-Wallis tests for the pluck-bow ses- 
sions yielded no reliable differences [x^(l) ranged from. .0.02 to 1.73] between 
the two presentation orders. 

DISCUSSION 

The present data incidate that infants as young as two months of age per- 
ceive rise-time cues in sawtooth-wave stimuli in a categorical manner, as do 
adults (Cutting and Rosner, 1974). This constitutes the first demonstration 
that infants perceive acoustic stimuli other than speech in a categorical fash- 
ion. Our results are consistent with those observed for speech stimuli (for 
example, Eimas et al., 1971; Eimas, 1974, 1975a, b), since infants displayed a 
reliable increase in sucking only for stimuli chosen from opposite sides of the 
adult categorical boundary. 

How can we explain the two-month old's propensity to categorize "plucks" 
and "bows"? One relevant result (Cutting and Rosner, 1974) is that rise-time 
is a sufficient cue for the categorical perception of [fa] and [t/a] as in 
"shop" and "chop." One possible explanation for the present results, then, is 
that the sawtooth-wave stimuli are perceived categorically just because rise- 
time is a salient dimension in speech perception. By one interpretation of this 
linguistic hypothesis, however, every acoustic dimension which is perceived 
categorically in speech should also be perceived categorically in nonspeech 
sounds. Yet, Mattingly et al. (1971) reported that second formant transitions 
which are perceived categorically in speech are not perceived categorically 
when heard in isolation. These results undercut the strong version of the lin- 
guistic hypothesis. An alternative formulation would hold that all dimensions 
perceived categorically in nonspeech sounds also are perceived categorically in 
speech sounds. Locke and Kellar's (1973) report of categorical perception of 
triadic chords seems to contradict this view. Acceptance of this weak version of 
the hypothesis also leaves open the question of why some dimensions and not 
others are perceived categorically outside of speech. 

A second hypothesis can account for our results. This acoustic hypothesis 
argues that the categorical perception of [fa] and [t/a] is merely a special . 
case of the categorical perception of the acoustic dimension of rise- time. In- 
deed, many other nonspeech stimuli may also be perceived categorically. Accord- 
ing to this view, the categorical perception of speech sounds is a consequence 
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of general properties of the auditory system rather than of a special system 
devoted entirely to the perception of speech. This is supported by a number of 
results. For example, Lisker and Abramson (1964), Cooper (1974), and Stevens 
and Klatt (1974) have demonstrated that voice-onset-time (VOX) is actually com- 
posed of several acoustic cues. Selective adaptation with the individual acous- 
tic cues from these dimensions produced boundary shifts along the VOX continuum 
(Lisker, 1975). Similarly, Xartter and Eimas (1975) demonstrated that a number 
of acoustic cues produced selective adaptation effects for the place-of-articu- 
lation continuum as well as for the "VOX continuum. Xheir investigations led 
Xartter and Eimas to conclude that some selective adaptation effects previously 
thought explicable only by a phonetic model (for example, Eimas, Cooper, and 
Corbit, 1973) -.can be more simply handled by reference to acoustic features. 
Thus, these recent studies tend to show that more and more of the presumably 
unique features of human speech can be explained in. terms of the acoustic pro- 
perties of the sounds. Perhaps the particular combination of information avail- 
able for auditory analysis determines the activation of higher level analyzers 
which possibly deal only with phonetic information. Xhus, the human's tendency 
to perceive categorically is not limited to speech sounds. Xhe number and 
variety of nonspeech sounds which are preceived categorically remains to be 
determined. 
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Categorical Perception Along an Oral-Nasal Continuum* 
4- 

Roland Handler 



ABSTRACT 

Dental and labial nasal consonants were constructed using two 
methods of synthesis, one employing the nasal branch resonances, 
and one the oral branch resonances of the OVE III in simulation of 
period of closure. Oral-nasal continua were generated for both 
places (/da/ to /na/ and /ba/ to /ma/) for both methods of syn- 
thesis. Identification and same-different discrimination tests 
from all four resulting sets were administered to thirteen sub- 
jects. Their responses yielded strong evidence for categorical 
perception along the oral-nasal dimension. 

Extensive analysis of the structure of nasal consonants by Fujimura (1962) 
and other researchers has revealed the predominant cue value of two basic com- 
ponents: one, the presence of a low amplitude noise through the period of 
closure, and two, a stoplike transition following the closure. Using a tape- 
splicing technique, Malecot (1956) confirmed that place cues were carried in 
the transition and nasality was cued by the low amplitude noise through the 
period of closure. A study by Liberman, Delattre, Cooper, and Gerstman (1954) 
using synthetic speech employed identical transitions for nasals and oral stops 
to get labeling judgments for continua across place of articulation. Both sets 
yielded good identification. These and other perception experiments suggested 
that listeners perceived such stimuli categorically, that is, distinguished 
stimuli across phonetic categories but not within categories, despite iden- 
tical acoustic variations. The present experiment, was undertaken to deter- 
mine whether categorical perception could be evidenced along an oral-nasal 
continuum. 

METHOD 

Stimulus Specifications 

The specifications of the OVE III serial synthesizer at the Haskins 
Laboratories allowed for two methods of construction of nasals from stops: 



*Presented at the 91st meeting of the Acoustical Society of America, 
4 April 1976, Washington, D.C. 

^Also University of Connecticut, Storrs. 
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the first method, henceforth referre.d to as the oral branch method, simulates 
nasals on the oral branch of the synthesizer by making use of wider bandwidth 
settings for the first and second formants through, the period of closure; the 
second method, henceforth referred to as the nasal branch method, preceded and 
overlayed oral branch stop transitions with the output of the nasal branch of 
the synthesizer, with one variable formant and a number of higher fixed for- 
mants • 

Consonant--vowel noii..ense syllables in configuration were chosen as the 
basic stimuli of the experiment. The neutral vowel /a/ was employed for both 
methods, with continua constructed for bilabial (that is, /ma/ to /ba/) and 
alveolar (/na/ to /da/) places of articulation. Figure 1 illustrates the ex- 
treme ends of the bilabial continua for both methods, with shading through the 
portions varied. 



ORAL BRANCH METHOD NASAL BRANCH METHOD 

4r 




100 200 300 400 500 0 100 200 300 400 500 

Time in msec 



Figure 1: Schematic diagram of extreme nasal bilabial stimuli using both meth- 
ods. Shading indicates portions varied through continua. 



The extreme nasal stimulus of the oral branch method bilabials had the 
following structure: an 80-TO.sec period of closure with F^ at 240 Hz, F2 at 
1000 Hz, and F3 at 2600 Hz, followed by a 40-msec transition to the vowel 
steady-state values of 820 Hz for F^, 1180 Hz for F2, and 2630 Hz for F3. The 
vowel duration was 280 msec, resulting in a total stimulus length of 400 msec. 
Over the nine stimuli of the continuum, three parameters were varied in equal 
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steps during the period of closure: oral branch amplitude was varied from 0 
through 24 dB relative to the amplitude of the vowel steady-state; bandwidth 
was varied through the range 205 to 70 Hz; bandwidth was varied through the 
range 350 to 80 Hz. The alveolar set consisted of the same acoustic variations 
with differences only in the initial formant frequencies, namely F2 at 2000 Hz 
and F3 at 2800 Hz. - 

The extreme nasal stimulus of the nasal branch method bilabials had the 
following structure: nasal branch excitation with its lowest resonance at 240- 
Hz through the 80-msec period of closure and the 40-msec stop transition; oral 
branch resonances initiated the transition from values of 240 Hz for F-]_, 1000 
Hz for F2, and 2200 Hz for F3. The schematic does not illustrate the fixed 
upper formants of the nasal branch to prevent confusion with the upper for- 
mants of the oral branch. Only nasal branch amplitude was varied in this set, 
through the range 0 dB to -14 dB relative to the amplitude of the vowel steady- 
state. This variation resulted in an eight-member continuum. The alveolar set 
contained the same nasal formants and variations, with initial oral branch for- 
mant values identical to those of the oral branch method alveolar set. 

Procedure 

Pilot free-choice labeling tests were given to 20 subjects to determine 
that end points represented the nasals and stops intended, and to assure that 
only those two categories were perceived through all continua. 

Final forced-choice identification tests, one randomly arranged test for 
each continuum, consisted of four presentations of each stimulus. Each pre- 
sentation contained two samples of the stimulus with a one-second inter stimulus 
interval. The interval between presentations was four seconds. Thus, each 
test *set for the oral branch method contained 36 presentations, while each 
test set for the nasal. branch method contained 32 presentations. 

Same-different discrimination tests, one for each of the four continua, 
had the following form: as sames , four randomly arranged pairs of each stimu- 
lus were included; as differents, four randomly arranged pairings of adjacent 
stimuli, two in the order AB and two in the order BA, appeared. The inter- 
stimulus interval was 500 msec, and the interval between pairs was four seconds. 
The resulting oral branch method test sets consisted of 64-pair presentations 
for each place, while the nasal branch method sets consisted of 56 pairs each. 

Subjects 

Fifteen paid subjects, all University of Connecticut students, took all 
eight tests. All were native speakers of American English who claimed normal 
hearing ability and phonetic naivety. Eight were right-handed males and seven 
were right-handed females. The test sets were presented binaurally through 
headphones in a soundproofed room. Up to five subjects took the tests -.at one 
time . -^All^ four identification tests were presented first; the four discrimina- 
tion tests followed. 
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RESULTS 



Figure 2 shows bilabial identification and discrimination results on the 
oral branch method for subject DC. This subject was given the tests again 
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Figure 2: Oral branch method bilabial results for subject DC. 
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later in order to get more reliable identification and discrimination curves. 
Each point on tfte labeling graphs thus represents eight judgments, while each 
discrimination point represents sixteen responses • Standard methods of data 
analysis were used. 

The labeling crossover occurs in the region of stimulus 6, The discrim- 
ination curve rises from a baseline value of approximately 50 percent to a 
peak of 75 percent in the region between stimuli 6 and 7, Thus, stimuli 1 
through 5 formed the nasal category, and stimuli 7 through 9 made up the stop 
category for this subject. The alveolar data given by DC showed a similar 
pattern of agreement between labeling crossover and discrimination peak. 
Crossover was at stimulus 6 and the discrimination peak of 82 percent occurred 
for the pairing of stimuli 6 and 7, 

Figure 3 shows the nasal branch method bilabial results for the same sub- 
ject. The labeling crossover corresponds to stimulus 5 (which had nasal 
amplitude of -8 dB relative to the vowel steady-state amplitude). The dis- 
crimination peak of 89 percent corresponded to the pairing of stimuli 5 and 
6. Again, good agreement is evident between crossover and discrimination peak. 
The alveolar data for this stimulus type showed a labeling crossover between 
stimuli 4 and 5, *with a discrimination peak" of 89 percent in that same region. 

Figure 4 depicts discrimination responses for all four test sets given by 
DC, Correspondences across place for peak and baseline data show good con- 
sistency. Notable in the oral branch results is the fact that the slope 
toward the peak from the nasal category is greater than that away from the 
peak toward the oral-stop category. This sugge.<=?ts that a peculiarity of the 
method introduced greater cue value for within-category discrimination of oral 
stops, than for nasals. 

The nasal branch method did not yield consistent correspondences between 
peak values across place. The peak for the bilabial set peak occurred in the 
region of stimuli 5 and 6, while the alveolar peak was in the region of 
stimuli 4 and 5, despite the fact that the acoustic variations were identi- 
cal. Both sets, nonetheless, yielded well established nasal and stop cate- 
gories at either end. 

Group data for both stimulus types shows essentially the same patterns 
of distribution as were evident in this individual's responses. Perceptual 
variation across subjects causes a wider region of indecision around cross- 
overs and wider and somewhat depressed, discrimination peaks. What is appar- 
ent in even the group data, however, is the consistent correspondence between 
crossovers and discrimination peaks. These correspondences offer substan- 
tial evidence for categorical perception along an oral-nasal continuum, 

A relevant problem brought out by the results of this experiment concerns 
a hypothesis of Fujisaki and Kawashima (1968, 1969, 1970), They proposed 
that consonants are perceived categorically, and vowels less so, due to the 
acoustically transient character of the consonantal cues. The continua of 
this study, however, relied on an 80-msec steady-state noise with varying 
amplitude for cue value. The categorical perception observed here therefore 
cannot be attributed to transience of the distinctive oral-nasal cue. 
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Stop Voicing Production: Natural Outputs and Synthesized Inputs* 
Leigh Lisker 



ABSTRACT 

In recent years the initial stop consonants of English have been 
subjected to the relentless attention of speech researchers concerned 
with the basis for their division into the two category-sets /p,t,k/ 
and /b,d,g/. The data which suggest the several hypotheses currently 
entertained have two main sources: natural production of "normal" 
speakers of the language operating in "normal" fashion, and the re- 
sponses of persons of like description to synthesized speech stimuli 
designed to measure the effect of systematic variation of selected 
acoustic features. The responses required of subjects in tests of 
synthetic speech can hardly be considered representative of their 
behavior in responding to natural speech; what the testing of syn- 
thetic speech demonstrates is the capability of the perceptual system 
to deal with the features selected for study, not that this capability 
is necessarily exploited in the perception of speech. Two kinds of 
information of relevance to the question of speech cues have not been 
collected: (1) the extent to which features having potential cue 
value show variations in natural speech that match the magnitudes 
tested in synthesis, and (2) the extent to which features for which 
distinctive function is claimed may be subjected to experimental ma- 
nipulation by skilled speakers without significantly reducing intel- 
ligibility. Experimental data are presented to indicate that at least 
one acoustic feature that affects stop-voicing perception in synthetic 
speech is of marginal or less importance in the perception of natural 
speech. 

Let us consider the hypothesis that listeners, in making a stop voicing 
decision, attend primarily to that part of the signal produced by a stop-vowel 
articulation which comes immediately after onset of voicing, and that they re- 
port /b,d,g/ if they detect a f irst-f ormant frequency shift and otherwise re- 
port /p,t,k/. 

The experiments from which this hypothesis derives were reported by Stevens 
and Klatt (Stevens and Klatt, 1974), who established that the (Voice Onset Time) 
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VOT boundary (Lisker and Abramson, 1964) between synthetic /da/ and /ta/ sylla- 
bles shifts with increase in the duration of the formant transitions, and that 
when the interval of voiced Fi transition is decreased to a point where it may 
no longer be detected by the listener, the reported stop is /t/. A replication 
and extension of their experiment showing the shift in VOT boundary with tran- 
sition duration was reported to the Acoustical Society of America by Lisker, 
Liberman, Erickson, and Dechovitz (1975), and those data, shown in Figure 1, 
are in close agreement with the finding of Stevens and Klatt, so far as demon- 
strating that the VOT boundary is not fixed. 

However, as emerges more clearly in the lower display of Figure 2, it appears 
that increasing the transition duration effects an even more drastic shift in 
the boundary duration of the voiced Fi transition (VTD) than in VOT, Moreover, 
the patterns in both the M.I.T. and the Haskins experiments just referred to 
might be as well described by reference to at least two other measures, namely 
the F-j onset frequency and the frequency range of transition. To be sure, of 
the 20 subjects who provided these data, there was one whose judgments make better 
sense if described as responses to voiced F^ transition duration, but for the 
subjects as a group, VOT seems to have been a more compelling cue. 

The data of Figure 1 are replotted in another way in Figure 3 to answer 
the following question: How effective is varying overall transition duration 
(or slope), and thereby altering VTD for fixed VOT values, as a factor affecting 
stop labeling behavior? From this display we see that judgments shift category 
with increasing transition duration for only three values of VOT, that is, +25, 
+35, and +45 msec. No transition durations yield more than a negligible number 
of /ta/ judgments for VOT less than +25 msec, or /da/ judgments for VOT greater 
than +45 msec. For the three curves of Figure 3 which cross the 50 percent line, 
the VTD values at the crossover are respectively about 10, 30, and 50 msec, and 
this shift in VTD boundary value is just double the amount of shift in VOT 
boundary placement. 

It should be remembered in connection with this comparison of VOT and VTD 
measures, that they are not independent for any particular stimulus, since their 
sum is, of course, simply the combined durations of burst and transition. VTD 
is just another measure of voice onset timing, differing from VOT only in that 
it takes as the temporal reference point the onset of the steadystate vowel 
instead of the burst. The fact that this point is much less reliably determined 
in spectrograms of natural speech than in synthetic speech patterns designed 
with this measure in mind, does not make implausible the hypothesis that a de- 
tector which registers the presence of voiced Fi transition provides the basis 
for the stop--voicing decision; it does make VTD a rather less convenient measure 
to apply in the acoustic analysis of stop-vowel sequences. 

However, there are other questions with respect to this hypothesis when 
we consider some other experimental data. If the transition detector fails to 
sense a voiced Fi transition, ve should regularly obtain a /p,t,k/ judgment; 
when a stimulus has a transition which under some circumstances provokes /b,d,g/ 
responses, we should expect it regularly to trigger the detector, barring pos- 
sibly the special circumstance of fatiguing that is alleged to explain the 
adaptation effect (Eimas and Corbit, 1973). In Figure 4 we have labeling re- 
sponses to stimuli derived from naturally produced syllables cut and recombined 
by an electronic segmentation procedure (Cooper and Mattingly, 1969)^ The upper 
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Transition Duration and VOT /da/ vs /ta/ 
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/da/-/ta/ VOT Crossover and Transition Duration 
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The data of Figure 1 are represented In the four curv'es shown. For 
the transition durations tested twice, the curves show overall mean 
values; the short vertical lines indicate the magnitude of the dif- 
ferences ill the means of Test 1 and Test 2. 



158 




Figure 3: The data of Figure 1 are represented here to show the effect of vary- 
ing transition duration on the /da/-/ta/ labelings, with VOT as the 
parameter. 
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Figure 4: Labeling data obtained with stimuli derived from natural productions 
of the nonsense syllables /ka/ and /ga/ and the isolated words cold 
and gold . These syllables were digitalized for computer storage and 
editing to produce various combinations of initial and final segments 
of the original signals. The curves in the upper graph show labelings 
of stimuli composed (1) by varying amounts of the voiceless onset of 
/ka/ combined with the voiced residue of the same syllable, and (2) by 
combining these same voiceless /ka/ onsets with the voiced residue of 
/ga/ obtained by deleting. the /g/-burst. The same operations on the 
monosyllabic words yielded the labelings given in the lower graph. 
Twelve Ss gave ten responses each to each of 23 stimuli derived from 
/ka,ga/ and 28 stimuli from cold-gold . 
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curves give percent /ka/ Judgments to two stimulus sets. In one, the VOT of 
a syllable /ka/ was varied by reducitig to varying degrees the duration of the 
voiceless aspiration, preserving both the original burst and the voiced portion 
of the syllable; in the other, the same operation was performed on the same /k/ 
aspiration, but the voiced portion of the stimuli was derived from a spoken 
/ga/ by deleting the /g/ release transient. Both stimuli yielded, for appro- 
priate VOT values, both /ka/ and /ga/ judgments, with a difference in crossover 
values of about 15 msec. It is true that the first set produced, to judge from 
the curves, rather less convincing /ga/ syllables than the second did /ka/ 
syllables, but despite the absence of a /g/ voiced transition in the first stim-- 
ulus set, more than half the responses reflected the presence of the short VOT 
rather than the absence of /g/ transition. Similar operations were used to ob- 
tain the stimuli for a second experiment in which subjects were asked to make 
word rather than phoneme identifications. Here too the absence of /g/ transi- 
tion did not block "gold" responses to stimuli with VOT values of less that +30 
msec, nor did the presence of /g/ transition prevent "cold" judgments for VOT 
greater than about +42 msec. 

For the last experiments to be reported we return to pure synthesis. In 
the first of these experiments, a stimulus set was generated whose end points 
are illustrated by the schematic spectrograms on the left and upper right in 
Figure 5. The first formant, transition and all, was retracted by varying 
amounts with a maximum delay of 50 msec after onset of the upper formants. The 
Fi voiced transition detector should fire equally for any one of the set to pro- 
duce a /b/ response. The labeling curves of the upper data display show that 
judgments shifted from /ba/ to /pla/, with an intermediate zone in which both 
/pla/ and /bla/ were reported. When VOT exceeds +35 msec, it appears that the 
presence of the buzz-excited Yi transition is interpreted, not entirely as a 
/b/ cue, but as a cue also to the presence of an additional phonetic segment 
preceding the vowel. If /ba/ and /bla/ responses are siimmed, a /b/--/p/ boundary 
can be located at about VOT = +40 msec. Recalling that for patterns incor-- 
porating F]^ cutback of more orthodox type (Liberman, Delattre, and Cooper, 1958), 
the boundary value is generally placed near VOT = +25 msec, we find it inter- 
esting that the effect of preserving the F^ transition intact is equal to the 
VOT boundary difference attributable to presence vs. absence of the /g/ voiced 
transition in the experiments involving manipulation of naturally produced 
syllables. 

In the two remaining experiments represented in Figure 5, the stimulus sets 
also contained the left pattern at one extreme, with one. of the two lower pat- 
terns on the right at the other. In neither of these sets is there any hiss 
excitation, and no /pa/ or /pla/ responses were elicited. For the set with F^^ 
retraction, /b/ responses were registered for amounts of retraction up to a 
magnitude of 50 msec lag behind the voiced upper formants, at which point re- 
sponses shifted to /bla/. In the final stimulus set tested there was no F^^ 
cutback, but only a variable delay in shifting F^ frequency from its low onset 
value to the steadystate value for the vowel; in this case a shift from /ba/ 
to /bla/ occurred when the transition was delayed about 25 msec relative to the 
higher formants. If the same feature detector said to operate in the /b,d,g/ 
vs. /p,t,k/ decision is also at work here, it seems that the phonetic inter- 
pretation of its output is not independent of the temporal relation between 
the activating feature and the other acoustic properties which signal stop ar- 
ticulation. 



VOT vs First Formant Transition Timing 




First Formant Transition Lag 
(msec) 



Figure 5: Labeling responses of 10 Ss (8 trials) to three sets of test patterns, 
all having as variable the timing of f irst-f ormant transition. The 
upper graph shows responses to a stimulus set in which the signal pre- 
ceding Fi-transition onset was hiss-excited; the mid graph gives data 
for a set in which only buzz excitation was present; the lower display 
is of data derived from a set in which the f irst-formant onset was 
simultaneous with that of the upper formants, and the first formant 
was maintained at the onset frequency until onset of the transition. 
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If variability in VOT boundary location observed in the data from exper- 
iments in synthesis means that the voicing decision depends on features in 
addition to VOT, this by no means implies that some other more stable feature 
must turn up. Speech being what it is in the temporal dimension generally, it 
is not totally unexpected that VOT resists any very simple description in its 
perceptual aspects. It is, I think, also worth mentioning that whereas ex- 
perimentally determined cue value of VOT and the boundary values of' that fea- 
ture are both consistent with measurements on natural speech, the same cannot 
be said for transition duration as a significant variable. There is no evi- 
dence so far that natural speech exhibits a matching variation correlated with 
the linguistic difference. In fact, somewhat oddly, one well-known study re- 
porting an extensive set of transition duration data (Lehiste and Peterson, 
1961) found consistently shorter durations for /b,d,g/ than for /p,t,k/. In 
a well-regulated world the reverse relation would allow an occasional impre- 
cision in voice onset timing to be compensated for by a longer duration of 
voiced transition in /b,d,g/ production, or its shorter duration -in /p,t,k/. 
Demonstrations that features such as fundamental frequency and transition dura- 
tion are available as stop voicing cues are not invalidated by any evidence 
that they, are not provided by natural speech signals. However, we should be 
wary of a too ready acceptance of the Panglossian view that speech productive 
behavior matches perfectly the properties of the auditory-phonetic perceptual 
mechanism. A good enough match, by definition, "yes." A perfect one? Perhaps 
yes, but only "perhaps." 
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Shifts in Vowel Perception as a Function of Speaking Rate* 
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ABSTRACT 

In rapid speech, acoustic analysis reveals that steady-state 
vowel targets characteristically are not reached. Lindblom and 
Studdert-Kennedy (1967) found in an experiment with synthetic speech 
that listeners showed a shift in the boundary between medial vowels 
III and /u/ with variations in the rate and direction of formant 
transitions. Apparently, perceivers compensate for simulated artic- 
ulatory undershoot by perceptual overshoot. An experiment with nat- 
ural speech demonstrated shifts in the acoustic criteria listeners 
employed in vowel recognition as a function of perceived rate of 
utterance. Nine American English vowels in /p-p/ environment were 
produced by a panel of 15 talkers in a fixed sentence frame. The 
destressed, rapidly- articulated /p-p/ syllables were excised from 
the tape recording and assembled into listening tests. Errors on 
vowels in the excised syllables averaged 23.8 percent. Errors jumped 
'*to* 28.6 percent when point-vowel precursors were introduced, while 
presentation of the syllables in their original sentence context 
reduced errors to 17.3 percent. The results suggest that sentence 
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context aids vowel Identification by allowing adjustment primarily 
to a talker's tempo, rather than to the talker's vocal tract charac- 
teristics. 

Acoustic measurements of vowels in continuous speech often show a devia- 
tion of formant frequencies from t-.he steady-state values typical of slow 
citation-form speech. Lindblom (1963) characterized this effect as an "under- 
shoot" of "target" frequencies in rapid speech^ He argued that the degree of 
undershoot is a systematic function of the talker's tempo; thus, the underlying 
target may be fully specified by the formant contours even though the target 
value is never reached. Lindblom went on to suggest that listeners could com- 
pensate for the undershoot and infer the underlying target if they had informa- 
tion about the tempo of articulation. This information would presumably be 
carried by the formant contours and by syllable duration. Lindblom and Studdert- 
Kennedy (1967) found some support for this idea in a study with synthetic speech. 

In this study, we used natural speech to determine the extent of the. per- 
ceptual problem posed by rapid articulation. We were interested in what infor- 
mation allows listeners to achieve constancy of vowel perception across different 
speaking rates. In particular, we wondered whether the formant contours of a 
single syllable are sufficient to specify a talker's tempo, or whether longer 
stretches of speech are necessary. 

Imagine snatching a syllable from running speech and presenting it to a 
listener for identification. It seems reasonable to suppose that the vowel in 
such a syllable would be less identifiable than the same vowel in a syllable 
spoken in citation form; the syllable will be shorter and there may be no region 
approximating a steady state. 

To test this supposition, we asked a panel of fifteen talkers to produce 
vowels at two different tempos: (1) in /p-p/ syllables spoken in citation-form, 
and (2) in /p-p/ syllables spoken in destressed position in the context of a full 
sentence. In the citation-form syllable test, each of the nine English monoph- 
thongs was represented five times, spoken by different talkers. Thus, listeners 
heard a total of 45 /p-vowel-p/ syllables. For the destressed syllable test, 
corresponding /p-p/ syllables for each talker were excised from the carrier 
sentences and assembled into a comparable test series. Separate groups of lis- 
teners heard the two tests. 

The results are shown in Figure 1. On the average, listeners misidentif led 
17 percent of the vowels in citation-form syllables and 24 percent of the vowels 
in destressed syllables. In a sense, the 24 percent error rate for the excised 
syllables is surprisingly low since the talkers varied from trial to trial, the 
syllables contained little or no steady-state energy, the syllable centers 
deviated from target values, and the syllables were very short in duration. 
Even so, the error rate was significantly greater than that for citation-form 
syllables. 

There are two possible reasons for the increase in errors on destressed 
syllables. The increase may reflect a greater overlap of cross-sectional for- 
mant frequency values for the destressed vowels or it may be a result of mis- 
perceiving the talkers' tempo. 
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Figure 1: Mean percent errors in identifying vowels in citation-form /p-p/ 
syllables and in destressed /p-p/ syllables excised from sentence 
context. 



An analysis of listeners' errors provides one means of answering this 
question. We applied an extension of Luce's Choice Axiom to the confusion 
matrices for each condition. The Luce model assumes that response probabil- 
ities are a fi^nction of two types of parameters: (1) similarities between each 
pair of stimulus categories, and (2) response biases for the various categories. 
If the increased errors on destressed syllables were due primarily to increased 
spectral overlap, we would expect a widespread increase in pairwise similarity 
values. No such increase was found. The major difference between citation- 
form and destressed syllables was found in response biases. Listeners were 
biased toward hearing the shorter vowel alternatives: for example, hearing 
/pcBp/ as /pep/, /pap/ as /pAp/, and /pup/ as /pup/. These bias shifts suggest 
that listeners treated the excised syllables as if they had originally been 
spoken in isolation, that is, as if they had been spoken more slowly in citation 
form. _ 

161 



Thus, the error pattern suggests that information about a talker's tempo 
Is critical in achieving constancy and that the information is not completely 
specified at the single syllable level. To make a more direct test of this, 
we prepared two additional listening conditions in which the same destressed 
syllables were embedded in longer stretches of speech. These contexts were 
intended to establish two different rates of articulation. In one condition, 
we preceded each test syllable with the precursors /hi/ha/hu/ spoken at a slow 
rate by the same talker. In the second condition, we presented the syllables 
in their original sentence context: "The little /p-p/'s chair is red." 

Results for the two context conditions are presented in Figure 2. The 
three bars on the right depict average error rates for vowels in the destressed 
syllables. Following the point-vowel precursors, errors rose significantly to 
29 percent, compared to 24 percent errors for the syllables heard in isolation. 
In contrast, errors dropped significantly to 17 percent when listeners heard 
the syllables embedded in their original sentence context. 
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Figure 2: Mean percent errors in identifying vowels under four conditions: 
citation-form /p-p/ syllables, destressed /p-p/ syllables excised 
from sentence context, the excised syllables when preceded by point- 
vowel precursors, and the destressed syllables embedded in their 
original sentence context. 
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An analysis of listeners' errors again proves very instructive. The dominant 
effect of the point-vowel precursors was to enhance the pattern of errors found 
when destressed syllables were heard in isolation. Response biases toward /pep/ 
and /pap/ were even larger than before. Apparently listeners treated the test 
syllables as if they had been spoken slowly, like the precursors, in citation 
form. Thus, the mismatch between perceived and actual tempo was even greater 
than it had been for the isolated syllables. 

When listeners heard the test syllables embedded in sentence context, there 
were no major response biases of the kind found for isolated syllables'. The 
biases toward /pep/, /pAp/, and /pup/ were substantially smaller in sentence con- 
text, and as a consequence, there were fewer errors for /peep/, /pap/, /pop/, and 
/pup/. The original sentence context apparently contained sufficient informa- 
tion to specify tempo accurately and to preclude the kinds of errors we found 
in the other two conditions. 

It is interesting to note that the error rate for destressed vowels in 
sentence context was very close to the 17 percent rate for vowels in citation- 
form syllables; the difference between the two conditions was not significant. 
This suggests that a vowel will be identifiable to the same degree whenever 
the full natural utterance is available to define the tempo. In the case of 
short sentences, the whole sentence is probably the natural unit of articulation. 
In the case of citation-form syllables, the syllable is a self-contained unit 
of articulation. There seems to be a stable level of ident if lability when the 
full natural unit is available to the listener. Failure to reach steady-state 
target frequencies does not necessarily make a syllable more ambiguous. If a 
listener is tuned to the ongoing tempo, a short destressed syllable is as fully 
determinate as a citation-form syllable. 

The results for the two context conditions raise a further possibility: 
a carrier sentence may aid identification more by defining tempo and stress 
than by defining the spectral range for a given talker. In the point-vowel 
precursor test, the information about rate of utterance was of greater sig- 
nificance for perception th an the range of spectral values provided by the pre- 
cursor string. Some researchers have proposed that experience with a talker's 
point vowels should reduce the ambiguity of subsequent utterances" (cf . Lieberman, 
1973). In the present study, at least, misinformation about tempo clearly out- 
weighed any helpful information to be gained from exposure to the point vowels. 

A similar conclusion seems appropriate for the sentence context condition: 
prosodic information was of greater perceptual significance than any available 
information about the talker's spectral range. As before, listeners ' errors 
provide a useful means for distinguishing these alternatives. If a carrier 
sentence mainly adjusts listeners to a talker's spectral range, we would expect 
extensive reductions in vowel similarities (specifically, reductions in the 
ambiguities due to talker differences). On the other hand, if the sentence 
mainly adjusts listeners to the talker's rate of speech, we would expect changes 
in the response biases for short and long vowel alternatives — and this is what 
we observed. Thus, the identification of vowels in sentence context was more 
sensitive to the transformation produced by tempo and stress than tc^^^e trans- 
formation produced by varying talkers. 

In general, these results point to the importance of dyiiamic properties of 
speech in the perception of vowels. The effects of prosody on the perception 
of phonemic segments deserves fuller exploration. 169 
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